add krakenuniq summary kmer filter

tomkinsc commented 4 years ago

To address https://github.com/broadinstitute/viral-classify/issues/1, this adds a new command, krakenuniq_report_filter, to metagenomics.py:

usage: metagenomics.py subcommand krakenuniq_report_filter [-h]
                                                           [--fieldToFilterOn {num_reads,uniq_kmers}]
                                                           [--fieldToAdjust {num_reads,uniq_kmers} [{num_reads,uniq_kmers} ...]]
                                                           [--keepAboveN KEEP_THRESHOLD]
                                                           [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                                           [--version]
                                                           [--tmp_dir TMP_DIR]
                                                           [--tmp_dirKeep]
                                                           summary_file_in
                                                           summary_file_out

Filter a krakenuniq report by field to include rows above some threshold,
where contributions to the value from subordinate levels are first removed
from the value.

positional arguments:
  summary_file_in       Input KrakenUNiq-format summary text file with tab-
                        delimited fields and indented taxonomic levels.
  summary_file_out      Output KrakenUNiq-format summary text file with tab-
                        delimited fields and indented taxonomic levels.

optional arguments:
  -h, --help            show this help message and exit
  --fieldToFilterOn {num_reads,uniq_kmers}
                        The field to filter on (default: uniq_kmers).
  --fieldToAdjust {num_reads,uniq_kmers} [{num_reads,uniq_kmers} ...]
                        The field to adjust along with the --fieldToFilterOn
                        (default: ['num_reads']).
  --keepAboveN KEEP_THRESHOLD
                        Only taxa with values above this will be kept. Higher
                        taxonomic ranks will have their values reduced by the
                        values of removed sub-ranks (default: 100)
  --loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}
                        Verboseness of output. [default: INFO]
  --version, -V         show program's version number and exit
  --tmp_dir TMP_DIR     Base directory for temp files. [default:
                        /var/folders/bx/90px0g1n1v122slzjnyvrsk5gn42vm/T/]
  --tmp_dirKeep         Keep the tmp_dir if an exception occurs while running.
                        Default is to delete all temp files at the end, even
                        if there's a failure. (default: False)

The behavior of this command is such that using a depth-first traversal, the lowest rows in the report have the value in the field specified by --fieldToFilterOn (default: uniq_kmers) zeroed out if their value is below the threshold given by --keepAboveN (default: 100). Under the assumption that higher taxonomic levels have cumulative values including contributions from the zeroed-out rows, the values of the selected field in higher levels are reduced by the amount contained within lower-levels that were below the specified threshold (the subtraction is propagated up the tree to the root node). Since the traversal is depth-first, the higher levels are eventually re-evaluated to see if they no longer meet the threshold after being subjected to subtraction of their lower levels.

The hierarchy of rows is read based on the indentation of the taxName column since many rows do not have a formal taxonomic rank assigned (i.e. their rank is "no rank").

Secondarily, the fields specified by --fieldToAdjust (default: num_reads) are similarly adjusted using the conditional threshold established by --keepAboveN and --fieldToFilterOn: their values are subtracted similarly, with propagation up the tree.

After adjustment to these counts across the entire tree, the part-of-whole percentages for the rows are reflected to reflect the new read counts. The resulting tree is then written out to a new KrakenUniq-format report, filtered to include only those rows meeting the initial threshold criterion.

dpark01 commented 4 years ago

Fun, you did it!

I think the whole time I was thinking of this little task, I always figured that the only way I'd really understand how it worked and whether it was working properly was to stare at how it handled a small suite of unit tests on really simple/stripped down inputs demonstrating the various edge cases. Something like:

a report that emerges unchanged after filtering (because its values are large enough)
a report that gets a few taxonomic leaves trimmed off (but no higher nodes)
a report that gets a higher node pruned in the first pass (before re-summing)
a report that gets a higher node pruned off after re-summing (due to pruned leaves)

Not sure if that's quite the right set of test cases so feel free to rethink it. But can you add unit tests?

Also: likely for a separate PR, but it'd be nice to add an optional param to provide a specific exclusion list of taxids (which would always remove those nodes including any lower ranking nodes beneath it.. and then recompute/sum upwards).

tomkinsc commented 4 years ago

Yup, I'll add unit tests—just wanted to get this open to avoid duplication of effort in case this was on anyone else's agenda.

yesimon commented 4 years ago

Don't mind if I pull this into viral-classify?

dpark01 commented 4 years ago

@yesimon please do -- but on a separate branch/pr from the refactor one (which will take longer to vet)

tomkinsc commented 4 years ago

Looks like this never got moved over to viral-classify; shall we?

broadinstitute / viral-ngs

add krakenuniq summary kmer filter #998