chanzuckerberg / idseq-workflows

Portable WDL workflows for IDseq production pipelines
https://idseq.net/
MIT License
31 stars 12 forks source link

fix bug with NA values in ska.distances.tsv and add known user error for divergent samples #131

Closed katrinakalantar closed 3 years ago

katrinakalantar commented 3 years ago

There are cases where NA values will appear in the ska.distances.tsv file for samples with no overlapping kmers. We were filling na values with zeroes at a later step in the ComputeClusters script, but this gives the false appearance that these samples are extremely similar. This update applies fillna = 1 earlier in the parsing of the ska.distances.tsv file so that the samples with NA values are maximally divergent in the resulting heatmap.

Additionally, cases where NA values appear in the ska.distances.tsv file may result in the case where there are no variants from which to construct the tree. This causes an error due to invalid inputs to iqtree (i.e. a malformed .fasta file). For example:

>seq1

>seq2

>seq3

This PR also adds functionality to catch such errors under the class of TooDivergentError.

This was tested on a sample set that generated NA values in the ska.distances.tsv file.

katrinakalantar commented 3 years ago

@morsecodist fasta filetype validation could improve our ability to transmit debuggable errors to the dev team (wouldn't make a difference from a user-interpretation side), but if you think it would help with maintainability then we should definitely do that. Do you think that should be wrapped into this PR?