Open tomkinsc opened 4 years ago
Fun, you did it!
I think the whole time I was thinking of this little task, I always figured that the only way I'd really understand how it worked and whether it was working properly was to stare at how it handled a small suite of unit tests on really simple/stripped down inputs demonstrating the various edge cases. Something like:
Not sure if that's quite the right set of test cases so feel free to rethink it. But can you add unit tests?
Also: likely for a separate PR, but it'd be nice to add an optional param to provide a specific exclusion list of taxids (which would always remove those nodes including any lower ranking nodes beneath it.. and then recompute/sum upwards).
Yup, I'll add unit tests—just wanted to get this open to avoid duplication of effort in case this was on anyone else's agenda.
Don't mind if I pull this into viral-classify?
@yesimon please do -- but on a separate branch/pr from the refactor one (which will take longer to vet)
Looks like this never got moved over to viral-classify; shall we?
To address https://github.com/broadinstitute/viral-classify/issues/1, this adds a new command,
krakenuniq_report_filter
, tometagenomics.py
:The behavior of this command is such that using a depth-first traversal, the lowest rows in the report have the value in the field specified by
--fieldToFilterOn
(default:uniq_kmers
) zeroed out if their value is below the threshold given by--keepAboveN
(default:100
). Under the assumption that higher taxonomic levels have cumulative values including contributions from the zeroed-out rows, the values of the selected field in higher levels are reduced by the amount contained within lower-levels that were below the specified threshold (the subtraction is propagated up the tree to the root node). Since the traversal is depth-first, the higher levels are eventually re-evaluated to see if they no longer meet the threshold after being subjected to subtraction of their lower levels.The hierarchy of rows is read based on the indentation of the
taxName
column since many rows do not have a formal taxonomic rank assigned (i.e. their rank is "no rank
").Secondarily, the fields specified by
--fieldToAdjust
(default:num_reads
) are similarly adjusted using the conditional threshold established by--keepAboveN
and--fieldToFilterOn
: their values are subtracted similarly, with propagation up the tree.After adjustment to these counts across the entire tree, the part-of-whole percentages for the rows are reflected to reflect the new read counts. The resulting tree is then written out to a new KrakenUniq-format report, filtered to include only those rows meeting the initial threshold criterion.