Closed tomkinsc closed 5 years ago
@yesimon Yeah, it would be great to handle non-even distributions once we have relevant test data. This PR is mostly intended to add an automated check on demultiplexed data, so pool quants are not available at the time of demux since they're not part of the sample sheet. It's certainly worth revisiting in the future. We could standardize the practice of adding pool fraction information to the sample sheet. I hope to have a better handle on paths forward after some time in the lab.
I haven't benchmarked it, but the additional runtime should be minimal since it compares the relatively few rows in the picard metrics file against a truncated and already-sorted list of (the top 1000) observed barcodes. I don't think it makes sense to spin it out into a separate task considering the additional overhead.
Sounds good, yeah that's going to be quite quick, since it's only processing two O(1)
input sources.
This adds a function,
guess_barcodes
, toillumina.py
to assist identifying barcodes that are outliers by read count and potential alternative barcode pairs that may make sense. The heuristic followed tries to find a barcode pair that is not used by another sample that 1) has one index match (assuming a laboratory swap impacted only one of the indices) and 2) has a higher read count. If single-barcode matches do not work but there is a higher read count option with two different barcode this is suggested instead. Where colliding pairs exist, the output is cautious and does not suggest alternatives, though outliers are still identified. Outliers are identified based on variance from the assumption of a balanced pool with one negative control, though threshold, number of controls can be set. As an alternative to finding outliers, the user can specify a sample name explicitly or define a readcount threshold below which barcodes will be reassessed. An error is issued if the number of assigned reads is <70% of the pool (configurable). A call toguess_barcodes
has been incorporated into the demultiplexing workflows, both Snakemake and WDL. A separate WDL task to call onlyguess_barcodes
is not included in this PR. Basic tests are included.