fpusan / SuperPang

Non-redundant pangenome assemblies from multiple genomes or bins
BSD 3-Clause "New" or "Revised" License
13 stars 1 forks source link

Error after identifying connected components #5

Closed cifuj closed 2 years ago

cifuj commented 2 years ago

Hi! I have an error after the step "identifying connected components". I installed SuperPang using conda and I am working with 4 bins (>90% compl <5% contamination) from the same species (ANI > 97%). test-SuperPang.py runs with out problems. The error occurs after running $SuperPang.py -f genomes/*fa -t 20 -o SuperPang --force-overwrite --checkm checkm_sox.txt

Traceback (most recent call last): File "/home/jcifuentes/miniconda3/envs/SuperPang/bin/SuperPang.py", line 291, in main(parse_args()) File "/home/jcifuentes/miniconda3/envs/SuperPang/bin/SuperPang.py", line 144, in main contigs = Assembler(input_minimap2, args.ksize, args.threads).run(args.minlen, args.mincov, args.bubble_identity_threshold, args.genome_assignment_threshold, args.threads) File "/home/jcifuentes/miniconda3/envs/SuperPang/lib/python3.8/site-packages/SuperPang/lib/Assembler.py", line 322, in run psets = get_psets(comp2nvs[nc1] | comp2nvs[nc2]) File "/home/jcifuentes/miniconda3/envs/SuperPang/lib/python3.8/site-packages/SuperPang/lib/Assembler.py", line 292, in getpsets assert len(nvs) == 2 AssertionError

The checkM file has the format Bin_Id Completeness Contamination, but I have the same error after running SuperPang with the --assume-complete flag.

fpusan commented 2 years ago

Could you share the input bins with me here in a zip file?

cifuj commented 2 years ago

I'm sorry, but I can't share them. Is there any other information I can provide? I tried running it with another set of bins (5 bins medium- to high-quality >95% ANI) and I have the same error.

fpusan commented 2 years ago

Hm, I wonder what the problem may be. Does test-SuperPang.py work for you? Are these "standard" bacterial/archaeal genomes? something "strange" about them? I have an idea of a couple of things that might be throwing that error, but it's hard to know what's really happening without being able to reproduce it myself (of course it is understandable if you can't share your data).

fpusan commented 2 years ago

Let's test a thing Edit the /home/jcifuentes/miniconda3/envs/SuperPang/lib/python3.8/site-packages/SuperPang/lib/Assembler.py In line 292 change assert len(nvs_) == 2 for if len(nvs_) != 2: print(nvs_); assert False This will still die but will print some info about what's going wrong. You can share the output (a horrible bunch of numbers) with me.

EDIT: Fixed a typo in the code

cifuj commented 2 years ago

I only obtained this output after the "Extracting forward component" [0, 1, 2, 3]

I changed the order of the bins and SuperPang run until comp 246/718 and then it fails also after: Extracting forward component [0, 1, 2, 3] Traceback (most recent call last): File "/home/jcifuentes/miniconda3/envs/SuperPang/bin/SuperPang.py", line 291, in main(parse_args()) File "/home/jcifuentes/miniconda3/envs/SuperPang/bin/SuperPang.py", line 144, in main contigs = Assembler(input_minimap2, args.ksize, args.threads).run(args.minlen, args.mincov, args.bubble_identity_threshold, args.genome_assignment_threshold, args.threads) File "/home/jcifuentes/miniconda3/envs/SuperPang/lib/python3.8/site-packages/SuperPang/lib/Assembler.py", line 322, in run psets = get_psets(comp2nvs[nc1] | comp2nvs[nc2]) File "/home/jcifuentes/miniconda3/envs/SuperPang/lib/python3.8/site-packages/SuperPang/lib/Assembler.py", line 292, in getpsets if len(nvs) != 2: print(nvs_); assert False AssertionError

This is how it look lines 291-293

 for nvs_ in psets.values():
    if len(nvs_) != 2: print(nvs_); assert False
 return psets
fpusan commented 2 years ago

Thanks for trying this. It seems like a different case of a bug I'd fixed before, but I would need more info Can you now try changing the whole get_psets function to this one?

            ### Extract forward component
            def get_psets(nvs):
                """Get a dict with a frozenset of vertices as keys, and the two complementary non-branching paths sharing those vertices as values"""
                psets = defaultdict(list)
                for nv in nvs:
                    pset = frozenset(vertex2NBP[nv])
                    psets[pset].append(nv)
                for fs, nvs_ in psets.items():
                    if len(nvs_) != 2:
                        print('\nERROR!', len(nvs_), fs)
                        print()
                        for nv_ in nvs_:
                            print(nv_, vertex2NBP[nv_])
                        print('\nDONE\n')
                    assert len(nvs_) == 2
                return psets
cifuj commented 2 years ago

This is the output after modifiying Assembler.py SuperPang_error_cifuj_20220407.txt

fpusan commented 2 years ago

Thanks! This gives me something to work with... I'll come back to you if I need some extra stuff

fpusan commented 2 years ago

Ok, try replacing the function with this

            ### Extract forward component
            def get_psets(nvs):
                """Get a dict with a frozenset of vertices as keys, and the two complementary non-branching paths sharing those vertices as values"""
                psets = defaultdict(list)
                for nv in nvs:
                    pset = frozenset(vertex2NBP[nv])
                    psets[pset].append(nv)
                for ps, nvs_ in psets.items():
                    if len(nvs_) > 2: # If more than two NBPs share the same pset, select the first two that are RC of each other
                        for nv1, nv2 in combinations(nvs_): # this is a temporary patch, I should look into why this happens when I get a proper example
                            NBP1, NBP2 = vertex2NBP[nv1], vertex2NBP[nv2]
                            if NBP1 == NBP2[::-1]:
                                psets[ps] = [nv1, nv2]
                                break
                        else:
                            assert False # if there is no RC pair, die anyways
                for nvs_ in psets.values():         
                    assert len(nvs_) == 2
                return psets

Hopefully this'll work

cifuj commented 2 years ago

I obtained the following error.

Traceback (most recent call last): File "/home/jcifuentes/miniconda3/envs/SuperPang/bin/SuperPang.py", line 291, in main(parse_args()) File "/home/jcifuentes/miniconda3/envs/SuperPang/bin/SuperPang.py", line 144, in main contigs = Assembler(input_minimap2, args.ksize, args.threads).run(args.minlen, args.mincov, args.bubble_identity_threshold, args.genome_assignment_threshold, args.threads) File "/home/jcifuentes/miniconda3/envs/SuperPang/lib/python3.8/site-packages/SuperPang/lib/Assembler.py", line 331, in run psets = get_psets(comp2nvs[nc1] | comp2nvs[nc2]) File "/home/jcifuentes/miniconda3/envs/SuperPang/lib/python3.8/site-packages/SuperPang/lib/Assembler.py", line 293, in getpsets for nv1, nv2 in combinations(nvs): # this is a temporary patch, I should look into why this happens when I get a proper example TypeError: combinations() missing required argument 'r' (pos 2)

I tried with 2 different set of bins from 2 different species and I obtained the same error after a different number of "Working on comp XX/YY"

fpusan commented 2 years ago

My fault, was missing an argument

            ### Extract forward component
            def get_psets(nvs):
                """Get a dict with a frozenset of vertices as keys, and the two complementary non-branching paths sharing those vertices as values"""
                psets = defaultdict(list)
                for nv in nvs:
                    pset = frozenset(vertex2NBP[nv])
                    psets[pset].append(nv)
                for ps, nvs_ in psets.items():
                    if len(nvs_) > 2: # If more than two NBPs share the same pset, select the first two that are RC of each other
                        for nv1, nv2 in combinations(nvs_, 2): # this is a temporary patch, I should look into why this happens when I get a proper example
                            NBP1, NBP2 = vertex2NBP[nv1], vertex2NBP[nv2]
                            if NBP1 == NBP2[::-1]:
                                psets[ps] = [nv1, nv2]
                                break
                        else:
                            assert False # if there is no RC pair, die anyways
                for nvs_ in psets.values():         
                    assert len(nvs_) == 2
                return psets
cifuj commented 2 years ago

Now I have a different error in the function #Find reverse complement Non-Branching Paths

Traceback (most recent call last): File "/home/jcifuentes/miniconda3/envs/SuperPang/bin/SuperPang.py", line 291, in main(parse_args()) File "/home/jcifuentes/miniconda3/envs/SuperPang/bin/SuperPang.py", line 144, in main contigs = Assembler(input_minimap2, args.ksize, args.threads).run(args.minlen, args.mincov, args.bubble_identity_threshold, args.genome_assignment_threshold, args.threads) File "/home/jcifuentes/miniconda3/envs/SuperPang/lib/python3.8/site-packages/SuperPang/lib/Assembler.py", line 332, in run assert len(psets) == len(comp2nvs[nc1]) == len(comp2nvs[nc1]) # each sequence path should have a reverse complement equivalent (same vertices in the kmer graph, reverse order) AssertionError

fpusan commented 2 years ago

Hi again! I managed to reproduce the issue with data from another user. Version 0.9.4beta1 (available in conda) has a temptative fix for this problem. Let me know if it works for you!

cifuj commented 2 years ago

Thank you very much! It works with the 2 datasets I had problems before.

fpusan commented 2 years ago

Glad to hear. Thanks for testing it!