10XGenomics / cellranger

10x Genomics Single Cell Analysis
https://www.10xgenomics.com/support/software/cell-ranger
Other
348 stars 92 forks source link

update mkfastq --qc to reliably parse Novaseq metrics #63

Closed jfx319 closed 10 months ago

jfx319 commented 4 years ago

Currently, cellranger mkfastq relies on the open source illuminate python module originally created by Invitae (a third party diagnostics company) many years ago, before Novaseq existed. The most current version of Invitae does not explicitly mention Novaseq under its supported sequencers.

However, I have observed that for my Novaseq run, the mkfastq --qc subroutine has trouble reporting similar metrics as that seen in the SAV, for example with raw clusters detected that are twice as many as reported by Illumina. Additionally, many fields are "null" in mkfastq's outs/qc_summary.json. Collectively, something is off about the metrics parsing for a Novaseq run. In light of the fact that illuminate module doesn't support Novaseq, perhaps this is the root cause.

Fortunately, Illumina has now created their own open source interop tool to parse the InterOp folder of a run (where the metrics info per lane, per tile, per cycle are stored). In addition to explicitly supporting Novaseq, it also is backwards compatible with prior sequencers, except for the ancient genome analyzer machines. There also exists python bindings and which are similarly installable via pip. Some ipynb tutorials also exist, for a quickstart: https://github.com/Illumina/interop/tree/master/docs/src

Would it be worth porting mkfastq's qc.py subroutine to Illumina's interop for future Novaseq compatibility? The alternative would be to re-write illuminate to support Novaseq -- but this is a third-party tool, so why not use Illumina's own tool?