broadinstitute / viral-ngs

Viral genomics analysis pipelines
Other
190 stars 67 forks source link

add function to assist Illumina index correction #917

Closed tomkinsc closed 5 years ago

tomkinsc commented 5 years ago

This adds a function, guess_barcodes, to illumina.py to assist identifying barcodes that are outliers by read count and potential alternative barcode pairs that may make sense. The heuristic followed tries to find a barcode pair that is not used by another sample that 1) has one index match (assuming a laboratory swap impacted only one of the indices) and 2) has a higher read count. If single-barcode matches do not work but there is a higher read count option with two different barcode this is suggested instead. Where colliding pairs exist, the output is cautious and does not suggest alternatives, though outliers are still identified. Outliers are identified based on variance from the assumption of a balanced pool with one negative control, though threshold, number of controls can be set. As an alternative to finding outliers, the user can specify a sample name explicitly or define a readcount threshold below which barcodes will be reassessed. An error is issued if the number of assigned reads is <70% of the pool (configurable). A call to guess_barcodes has been incorporated into the demultiplexing workflows, both Snakemake and WDL. A separate WDL task to call only guess_barcodes is not included in this PR. Basic tests are included.

usage: illumina.py subcommand guess_barcodes [-h]
                                             [--readcount_threshold READCOUNT_THRESHOLD | --sample_names [SAMPLE_NAMES [SAMPLE_NAMES ...]]]
                                             [--outlier_threshold OUTLIER_THRESHOLD]
                                             [--expected_assigned_fraction EXPECTED_ASSIGNED_FRACTION]
                                             [--number_of_negative_controls NUMBER_OF_NEGATIVE_CONTROLS]
                                             [--rows_limit ROWS_LIMIT]
                                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                             [--version] [--tmp_dir TMP_DIR]
                                             [--tmp_dirKeep]
                                             in_barcodes in_picard_metrics
                                             out_summary_tsv

Guess the barcode value for a sample name, based on the following: - a list is
made of novel barcode pairs seen in the data, but not in the picard metrics -
for the sample in question, get the most abundant novel barcode pair where one
of the barcodes seen in the data matches one of the barcodes in the picard
metrics (partial match) - if there are no partial matches, get the most
abundant novel barcode pair Limitations: - If multiple samples share a barcode
with multiple novel barcodes, disentangling them is difficult or impossible.
The names of samples to guess are selected: - explicitly by name, passed via
argument, OR - explicitly by read count threshold, OR - automatically (if
names or count threshold are omitted) based on basic outlier detection of
deviation from an assumed-balanced pool with some number of negative controls

positional arguments:
  in_barcodes           The barcode counts file produced by common_barcodes.
  in_picard_metrics     The demultiplexing read metrics produced by Picard.
  out_summary_tsv       Path to the summary file (.tsv format). It includes
                        several columns: (sample_name, expected_barcode_1,
                        expected_barcode_2, expected_barcode_1_name,
                        expected_barcode_2_name, expected_barcodes_read_count,
                        guessed_barcode_1, guessed_barcode_2,
                        guessed_barcode_1_name, guessed_barcode_2_name,
                        guessed_barcodes_read_count, match_type), where the
                        expected values are those used by Picard during
                        demultiplexing and the guessed values are based on the
                        barcodes seen among the data.

optional arguments:
  -h, --help            show this help message and exit
  --readcount_threshold READCOUNT_THRESHOLD
                        If specified, guess barcodes for samples with fewer
                        than this many reads. (default: None)
  --sample_names [SAMPLE_NAMES [SAMPLE_NAMES ...]]
                        If specified, only guess barcodes for these sample
                        names. (default: None)
  --outlier_threshold OUTLIER_THRESHOLD
                        threshold of how far from unbalanced a sample must be
                        to be considered an outlier. (default: 0.675)
  --expected_assigned_fraction EXPECTED_ASSIGNED_FRACTION
                        The fraction of reads expected to be assigned. An
                        exception is raised if fewer than this fraction are
                        assigned. (default: 0.7)
  --number_of_negative_controls NUMBER_OF_NEGATIVE_CONTROLS
                        The number of negative controls in the pool, for
                        calculating expected number of reads in the rest of
                        the pool. (default: 1)
  --rows_limit ROWS_LIMIT
                        The number of rows to use from the in_barcodes.
                        (default: 1000)
  --loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}
                        Verboseness of output. [default: INFO]
  --version, -V         show program's version number and exit
  --tmp_dir TMP_DIR     Base directory for temp files. [default:
                        /var/folders/bx/90px0g1n1v122slzjnyvrsk5gn42vm/T/]
  --tmp_dirKeep         Keep the tmp_dir if an exception occurs while running.
                        Default is to delete all temp files at the end, even
                        if there's a failure. (default: False)
tomkinsc commented 5 years ago

@yesimon Yeah, it would be great to handle non-even distributions once we have relevant test data. This PR is mostly intended to add an automated check on demultiplexed data, so pool quants are not available at the time of demux since they're not part of the sample sheet. It's certainly worth revisiting in the future. We could standardize the practice of adding pool fraction information to the sample sheet. I hope to have a better handle on paths forward after some time in the lab.

tomkinsc commented 5 years ago

I haven't benchmarked it, but the additional runtime should be minimal since it compares the relatively few rows in the picard metrics file against a truncated and already-sorted list of (the top 1000) observed barcodes. I don't think it makes sense to spin it out into a separate task considering the additional overhead.

dpark01 commented 5 years ago

Sounds good, yeah that's going to be quite quick, since it's only processing two O(1) input sources.