fedarko / strainFlye

Pipeline for analyzing (rare) mutations in metagenome-assembled genomes
BSD 3-Clause "New" or "Revised" License
8 stars 1 forks source link

Throw errors if we see degenerate nucleotides #12

Closed fedarko closed 2 years ago

fedarko commented 2 years ago

Either that, or ignore them, or test that they don't mess things up.

fedarko commented 2 years ago

This check is now implemented in the FASTA "parser" function (which really just uses skbio). Once all of the commands use this to process FASTA files of contigs, the main concern becomes dealing with reads that contain degenerate nucleotides -- which shouldn't happen in practice, I think.

call_utils.run() does account for this by raising an error if any position has a N aligned to it; also, the coverage computation(s) ignore degenerate nucleotides by implicitly defining coverage as A + C + G + T quantities.

Edit: So, although this handles N degenerates, pysamstats will still let other degenerate nucleotides (e.g. M, or even nonstandard things like Z) slip through -- in this case, these will not show up in stat_variation()'s output at all, and will instead just count toward mismatches at this position. See pysamstats code here.

We could try to handle this by writing code to scan through all of the input reads and raise an error if we see any degen nucleotides, but that would take a silly amount of time probs. Or I could make a PR to pysamstats that tries to update this, maybe (e.g. creating a degen value in each pileup record)...? Probably the easiest thing to do is just letting the user know about this in the docs. I guess having degenerate nucleotides count towards mismatches isn't a bad idea, it just messes with things a bit.