Not "massive" in the sense of "a large HiFi dataset", but massive in the sense of "this dataset is unrealistically massive and will start to cause weird overflow problems".
Fast-failing (e.g. This contig is too long) is fine, IMO -- the main thing I want to avoid is producing silently incorrect results.
I imagine most of the code should either work as expected, or fail loudly for arbitrarily large datasets. Python is good for this sort of stuff (in my experience, at least): it supports arbitrarily-large numbers, for example, and it'll throw an OverflowError if you try to make a ridiculously long string.
The main thing worth worrying about, I think, is our use of external libraries: samtools, minimap2, bcftools, prodigal, pysam, pysamstats, LJA.
I guess this issue can track the problems these libraries have with massive datasets; we can then add our own checks into strainFlye that fail fast and warn users if any of these problems come up.
For reference, 2^31 = 2,147,483,648 (2.14 billion). It's unlikely we'd see prokaryotic genomes this long, I think, but I could imagine this happening eventually.
It isn't clear to me what bcftools index's behavior is when it encounters a chromosome longer than this -- does it fail silently or loudly?
BCF files: see sections 1.3 and 6.3.3 of the spec for info on supported datatypes for each field.
... There are more issues besides this, this is just a start of this list.
Not "massive" in the sense of "a large HiFi dataset", but massive in the sense of "this dataset is unrealistically massive and will start to cause weird overflow problems".
Fast-failing (e.g.
This contig is too long
) is fine, IMO -- the main thing I want to avoid is producing silently incorrect results.I imagine most of the code should either work as expected, or fail loudly for arbitrarily large datasets. Python is good for this sort of stuff (in my experience, at least): it supports arbitrarily-large numbers, for example, and it'll throw an OverflowError if you try to make a ridiculously long string.
The main thing worth worrying about, I think, is our use of external libraries: samtools, minimap2, bcftools, prodigal, pysam, pysamstats, LJA.
I guess this issue can track the problems these libraries have with massive datasets; we can then add our own checks into strainFlye that fail fast and warn users if any of these problems come up.
bcftools index
, using the default CSI format, "...supports indexing of chromosomes up to length 2^31."bcftools index
's behavior is when it encounters a chromosome longer than this -- does it fail silently or loudly?BCF files: see sections 1.3 and 6.3.3 of the spec for info on supported datatypes for each field.
... There are more issues besides this, this is just a start of this list.