broadinstitute / viral-core

viral-ngs: read QC, barcode metrics, spike-in metrics, Illumina metrics and demux
Other
3 stars 0 forks source link

inference of sequencer model needs updating #108

Open dpark01 opened 1 month ago

dpark01 commented 1 month ago

As of 2024, illumina_demux's sequencer model emitted in its runinfo.json output is failing to infer the sequencer from recent NextSeq 2000 runs (not sure if they're XLEAP kits or just normal ones) and instead just emitting UNKNOWN. Probably just need to update the heuristics and tables here. Observed behavior both at Broad and ACEGID.

tomkinsc commented 1 month ago

It seems that all NextSeq 2000 run directories have a file called RunParameters.xml with various helpful values, including InstrumentType, so we may not need to resort to regex matching to sleuth out the model of newer sequencers. Ex.:

<InstrumentType>NextSeq 2000</InstrumentType>

We can obtain that value directly in Python like this:

python3 -c "import xml.etree.ElementTree as ET; tree = ET.parse('RunParameters.xml'); root = tree.getroot(); print(root.find('.//InstrumentType').text)"

(perhaps falling back to the old regex approach if the RunParameters.xml file does not exist)

Example of other values that may be interesting to parse out and/or use:

  <FlowCellLotNumber>20688106</FlowCellLotNumber>
  <FlowCellExpirationDate>2023-09-03</FlowCellExpirationDate>
  <FlowCellVersion>2</FlowCellVersion>
  <FlowCellMode>NextSeq 1000/2000 P2 Flow Cell Cartridge</FlowCellMode>
  <CartridgeSerialNumber>EC1194950-EC11</CartridgeSerialNumber>
  <CartridgePartNumber>20044466</CartridgePartNumber>
  <CartridgeLotNumber>20668878</CartridgeLotNumber>
  <CartridgeExpirationDate>2023-08-28</CartridgeExpirationDate>
  <CartridgeVersion>3</CartridgeVersion>
  <CartridgeMode>NextSeq 1000/2000 P2 Reagent Cartridge (338 Cycles)</CartridgeMode>

I'm curious if we can use CartridgeLotNumber to find any lot-related effects in the data, or if we can relate any data quality metrics to CartridgeExpirationDate.