Closed alephreish closed 4 months ago
Thanks for the reproducible example. Can you check if it still happens using --gap-size 1
?
While I appreciate your rationale for adding --gap-size
, the assembler was written and tested using sequences without Ns (since Ns got removed by read_fasta
) so now they may be breaking things in unexpected ways.
Anyways I'll try to download the data and check myself ASAP.
That's right, --gap-size 1
runs without an issue. Interestingly, CALHIT010000021.1 does not even have Ns.
Interesting, so it's GCA_002728275.1 that might be the real trouble-maker here and not GCA_938030875.1. GCA_002728275.1 has an exceptionally high number of ambiguities (0.45%), such that some kmers have more than one N. GCA_938030875.1 on the other hand has no Ns at all. Why the issue emerges when I add GCA_938030875.1 to the mix is not clear.
I managed to reproduce the error from your dataset. Thanks for the super easy commands to reproduce it.
SuperPang compresses nucleotides into 2 bit words in order to save some memory during graph creation, but of course this only works if your alphabet has only 4 letters (ACTG). The presence of Ns was leading to some undefined behaviour instead of throwing an error right away. This was manifesting later when checking that adjacent vertices in the DBG were actually associated to consecutive kmers.
Anyways I made it so that compression will only happen for --gap-size 1
, since no Ns should remain in that case. Otherwise SuperPang will use the uncompressed kmers as hash keys to check whether a kmer was already added to the DBG or not, this will consume more memory but otherwise give similar results. I checked the fix on your dataset and it now runs ok. The fix is already commited to the main branch
Wow, thanks for the quick fix! From a very cursory look, I also started suspecting compress()/decompress()
although the fact that it used to run smoothly with all those Ns before was puzzling... Thanks again!
What can be the trigger of the following AssertionError in
edge2kmer()
? It is caused by specific scaffolds in one out of 14 assemblies (I identified one such scaffold by binary search but there are more - probably one or two). Debugging shows thatlen(kmers) == 0
. I've tried different values for identities and kmer length.A minimal example:
One of the offending scaffolds is CALHIT010000021.1.