fedarko / strainFlye

Pipeline for analyzing (rare) mutations in metagenome-assembled genomes
BSD 3-Clause "New" or "Revised" License
8 stars 1 forks source link

Output file of "unreasonable position" information during naive calling #48

Open fedarko opened 2 years ago

fedarko commented 2 years ago

Figuring out unreasonable positions in a contig requires going through each position in this contig in the alignment; for edge 6104 (CAMP) in the SheepGut dataset, this takes about 13 minutes on the cluster. Which is reasonable, but slow. This is by far the longest step in FDR estimation (which, aside from this, takes maybe ~2 minutes or so for the ≥ 1 Mbp contigs in SheepGut), so if we could cut this time down that'd be great.

Since fdr estimate -- as of writing -- assumes that the input BCF file comes from call, it would make sense to have call produce a file listing unreasonable positions in all contigs that the user can directly pass into fdr estimate. This way, we avoid needing to look at the alignment at all (so, we can remove the --bam parameter from fdr estimate). Having this file of unreasonable positions around could be useful for other purposes, also (maybe for the mutation matrices)?

I'm not sure what the ideal format of this file would be, since the number of unreasonable positions in a given contig can range from 0 to length(contig). Relevant Stack Overflow thread here, although in that example the max number of columns has a hard limit across the entire file.

Maybe we could use a TSV file with two columns ("contig name" and "unreasonable positions"), where the second column contains a comma-separated list of (1-indexed, probably?) positions. This way, the file can be loaded easily using pandas or other TSV parsers (one "column" for each contig, besides the name), and each contig's list of positions can be treated as just a string and parsed easily.

maaaybe too much work for the short term