fedarko / strainFlye

Pipeline for analyzing (rare) mutations in metagenome-assembled genomes
BSD 3-Clause "New" or "Revised" License
8 stars 1 forks source link

Adjusting align to work around indexing limits #21

Open fedarko opened 2 years ago

fedarko commented 2 years ago

When the "reference" contigs file exceeds 4 gigabases, minimap2 won't produce a header. It's possible to adjust this using the -I option, or it's possible to use the --split-prefix option (the first option is faster but requires more memory, the second is slower but requires less memory and more disk usage).

The easiest way to handle this problem is to allow the users to pass a string of parameters to minimap2, so (if they run into this problem) they can pick which solution works best for them. This similar to the --p-java-flags parameter that Qemistree supports for some commands (see here).

A second way we may want to handle this (in addition to the above solution) is by detecting when the user's FASTA file of contigs exceeds 4 Gb, and warning them somehow. I mean, minimap2 already warns the user, but maybe we wanna be extra paranoid? If the SAM file doesn't have a header, I thiiiink the downstream conversion to BAM (or at least the indexing) should fail, but I'd prefer to fail as early as possible to avoid wasting people's time. Maybe warn folks in the tutorial and after gfa-to-fasta, if the file is big enough?