fpusan / SuperPang

Non-redundant pangenome assemblies from multiple genomes or bins
BSD 3-Clause "New" or "Revised" License
13 stars 1 forks source link

assert len(kmers) == 2 AssertionError #15

Closed alephreish closed 4 months ago

alephreish commented 4 months ago

What can be the trigger of the following AssertionError in edge2kmer()? It is caused by specific scaffolds in one out of 14 assemblies (I identified one such scaffold by binary search but there are more - probably one or two). Debugging shows that len(kmers) == 0. I've tried different values for identities and kmer length.

25/06/2024 17:52:03 109MB   Correcting input sequences with minimap2
25/06/2024 17:52:03 109MB   round   correctedFull   correctedPartial    mismatches  indels  errors
25/06/2024 17:52:04 109MB   1       2797    29      286774  13523   0
25/06/2024 17:52:08 109MB   2       1053    18      77760   19861   1
25/06/2024 17:52:14 109MB   3       475     3       24802   9993    0
25/06/2024 17:52:19 109MB   4       182     2       6123    13336   1
25/06/2024 17:52:26 109MB   5       147     4       2111    13808   1
25/06/2024 17:52:32 109MB   6       119     2       2347    13273   1
25/06/2024 17:52:38 109MB   7       101     3       2045    13339   1
25/06/2024 17:52:44 109MB   8       45      1       359     5470    0
25/06/2024 17:52:50 109MB   9       47      2       289     6790    0
25/06/2024 17:52:57 109MB   10      21      1       164     2128    0
25/06/2024 17:53:03 109MB   11      13      1       169     1419    0
25/06/2024 17:53:09 109MB   12      15      1       281     489     0
25/06/2024 17:53:15 109MB   13      5       2       155     6       0
25/06/2024 17:53:22 109MB   14      5       2       285     69      0
25/06/2024 17:53:28 109MB   15      5       1       176     -18     0
25/06/2024 17:53:34 109MB   16      5       1       143     8       0
25/06/2024 17:53:40 109MB   17      5       1       144     305     0
25/06/2024 17:53:47 109MB   18      1       1       0       304     0
25/06/2024 17:53:53 109MB   19      0       1       0       2       0
25/06/2024 17:53:59 109MB   20      0       1       0       2       0

This is SuperPang version 1.3.0

If using in publications or products please cite:

Puente-Sánchez F, Hoetzinger M, Buck M and Bertilsson S. Exploring intra-species diversity through non-redundant pangenome assemblies. Molecular Ecology Resources (2023) DOI: 10.1111/1755-0998.13826

25/06/2024 17:53:59 243MB   Reading sequences
25/06/2024 17:54:00 278MB   Allocating resources
25/06/2024 17:54:00 409MB   Creating DBG from 3708 sequences (3620 originals)
path_to_conda_env/lib/python3.8/site-packages/graph_tool/draw/cairo_draw.py:1544: RuntimeWarning: Error importing Gtk module: Typelib file for namespace 'GObject', version '2.0' not found; GTK+ drawing will not work.
  warnings.warn(msg, RuntimeWarning)
25/06/2024 17:54:55 986MB       100% bases processed, 3707 sequences, 3514099 vertices, 3513583 edges          
25/06/2024 17:54:55 986MB   Identifying connected components
25/06/2024 17:54:55 1044MB  Working on comp 1/683, 1302145 vertices
25/06/2024 17:54:56 1044MB      Collecting non-branching paths
25/06/2024 17:55:04 1044MB      Building sequence graph out of 2588 non-branching paths
25/06/2024 17:55:05 1044MB      Extracting forward component
25/06/2024 17:56:41 1053MB      Scaffold 0, 1294 NBP vertices, 1391 NBP edges, 1329109 bases, removed 1 bubbles, processing 1290 non-branching paths, done
25/06/2024 17:56:41 1054MB  Working on comp 2/683, 160922 vertices
25/06/2024 17:56:41 1054MB      Collecting non-branching paths
25/06/2024 17:56:42 1055MB      Building sequence graph out of 404 non-branching paths
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "path_to_conda_env/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "path_to_conda_env/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "path_to_conda_env/lib/python3.8/site-packages/superpang/lib/Assembler.py", line 1040, in reconstruct_sequence
    kmers = cls.edge2kmer(path[0], path[1],  vertex2coords, ref2name, seqDict, ksize)
  File "path_to_conda_env/lib/python3.8/site-packages/superpang/lib/Assembler.py", line 1078, in edge2kmer
    assert len(kmers) == 2
AssertionError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "path_to_conda_env/bin/SuperPang.py", line 8, in <module>
    sys.exit(cli())
  File "path_to_conda_env/lib/python3.8/site-packages/superpang/scripts/SuperPang.py", line 387, in cli
    main(args, uuid)
  File "path_to_conda_env/lib/python3.8/site-packages/superpang/scripts/SuperPang.py", line 200, in main
    contigs = Assembler(input_minimap2, args.ksize, args.threads, diskdb, args.debug, gap_size = args.gap_size).run(args.minlen, args.mincov, args.bubble_identity_threshold, args.genome_assignment_threshold, args.threads, args.debug)
  File "path_to_conda_env/lib/python3.8/site-packages/superpang/lib/Assembler.py", line 347, in run
    NBP2seq = dict(zip(NBPs, self.multimap(self.reconstruct_sequence, threads, NBPs, self.vertex2coords, self.ref2name, self.seqDict, self.ksize)))
  File "path_to_conda_env/lib/python3.8/site-packages/superpang/lib/Assembler.py", line 72, in multimap
    res = pool.map(fun, range(len(iterable)), chunksize = chunksize)
  File "path_to_conda_env/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "path_to_conda_env/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
AssertionError

A minimal example:

mkdir -p genomes
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/728/275/GCA_002728275.1_ASM272827v1/GCA_002728275.1_ASM272827v1_genomic.fna.gz | gzip -cd > genomes/GCA_002728275.1.fna
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/938/030/875/GCA_938030875.1_S1_Bin_MAXBIN_164_sub_1/GCA_938030875.1_S1_Bin_MAXBIN_164_sub_1_genomic.fna.gz | gzip -cd > genomes/GCA_938030875.1.fna
SuperPang.py -f genomes/*.fna -o superpang/ -t 20 -k 301 -i 0.95 -b 0.95 --gap-size 10 --force-overwrite

One of the offending scaffolds is CALHIT010000021.1.

fpusan commented 4 months ago

Thanks for the reproducible example. Can you check if it still happens using --gap-size 1? While I appreciate your rationale for adding --gap-size, the assembler was written and tested using sequences without Ns (since Ns got removed by read_fasta) so now they may be breaking things in unexpected ways. Anyways I'll try to download the data and check myself ASAP.

alephreish commented 4 months ago

That's right, --gap-size 1 runs without an issue. Interestingly, CALHIT010000021.1 does not even have Ns.

alephreish commented 4 months ago

Interesting, so it's GCA_002728275.1 that might be the real trouble-maker here and not GCA_938030875.1. GCA_002728275.1 has an exceptionally high number of ambiguities (0.45%), such that some kmers have more than one N. GCA_938030875.1 on the other hand has no Ns at all. Why the issue emerges when I add GCA_938030875.1 to the mix is not clear.

fpusan commented 4 months ago

I managed to reproduce the error from your dataset. Thanks for the super easy commands to reproduce it.

SuperPang compresses nucleotides into 2 bit words in order to save some memory during graph creation, but of course this only works if your alphabet has only 4 letters (ACTG). The presence of Ns was leading to some undefined behaviour instead of throwing an error right away. This was manifesting later when checking that adjacent vertices in the DBG were actually associated to consecutive kmers.

Anyways I made it so that compression will only happen for --gap-size 1, since no Ns should remain in that case. Otherwise SuperPang will use the uncompressed kmers as hash keys to check whether a kmer was already added to the DBG or not, this will consume more memory but otherwise give similar results. I checked the fix on your dataset and it now runs ok. The fix is already commited to the main branch

alephreish commented 4 months ago

Wow, thanks for the quick fix! From a very cursory look, I also started suspecting compress()/decompress() although the fact that it used to run smoothly with all those Ns before was puzzling... Thanks again!