chanzuckerberg / idseq-workflows

Portable WDL workflows for IDseq production pipelines
https://idseq.net/
MIT License
31 stars 12 forks source link

Filter taxids out of hit summaries #73

Closed morsecodist closed 3 years ago

morsecodist commented 3 years ago

This actually contains two changes, a bit confusing but one won't work without the other.

First, I make taxids all strings. We have a big problem in the codebase of taxids being strings/numbers. For filtering, they must be strings. In some places they had to be ints because we do math with them and compare them as numbers sometimes. I think a cleaner approach overall would actually be to make them always ints but it is much safer to go in the string direction since this is explicitly used in way more places and we very rarely do math on the ids.

Second, I filter out taxids from the hit summary from alignment. This has the effect of filtering them out of the refined hits ummary as well. It took a lot of analyzing the code to come to this conclusion but I am pretty confident about it at this point. I added an explainatory comment. In general no taxids will be in the refined hit summary if they are not in the hit summary.

morsecodist commented 3 years ago

@tfrcarvalho

Are we in any way exposed to bad taxids?

They are strings in the lookup dicts but they shouldn't be bad anywhere else. I agree the strings are a bit of a hack. I'd prefer to go ints but I felt it was a bit riskier. Do you think I should go ints?

Testing wrietup incoming

tfrcarvalho commented 3 years ago

They are strings in the lookup dicts but they shouldn't be bad anywhere else. I agree the strings are a bit of a hack. I'd prefer to go ints but I felt it was a bit riskier. Do you think I should go ints?

You have a better understanding of the immediate effort/risk (if our inputs do not have the chance of having malformed tax ids or the parser check the format , there is not much of a risk)... so, will leave final decision up to you.