fedarko / strainFlye

Pipeline for analyzing (rare) mutations in metagenome-assembled genomes
BSD 3-Clause "New" or "Revised" License
8 stars 1 forks source link

Warn / fail when we encounter massive datasets? #37

Open fedarko opened 2 years ago

fedarko commented 2 years ago

Not "massive" in the sense of "a large HiFi dataset", but massive in the sense of "this dataset is unrealistically massive and will start to cause weird overflow problems".

Fast-failing (e.g. This contig is too long) is fine, IMO -- the main thing I want to avoid is producing silently incorrect results.

I imagine most of the code should either work as expected, or fail loudly for arbitrarily large datasets. Python is good for this sort of stuff (in my experience, at least): it supports arbitrarily-large numbers, for example, and it'll throw an OverflowError if you try to make a ridiculously long string.

The main thing worth worrying about, I think, is our use of external libraries: samtools, minimap2, bcftools, prodigal, pysam, pysamstats, LJA.

I guess this issue can track the problems these libraries have with massive datasets; we can then add our own checks into strainFlye that fail fast and warn users if any of these problems come up.

... There are more issues besides this, this is just a start of this list.