compbiol / CAMSA

CAMSA: a tool for Comparative Analysis and Merging of Scaffold Assemblies
MIT License
24 stars 4 forks source link

Error computing assembly points conflicts with run_camsa.py #13

Closed arivers closed 5 years ago

arivers commented 6 years ago

I processed two assemblies and a scaffold with fasta2camsa to create points files. When I use run_camsa.py on the two points files I get the error message:

Run command:

run_camsa.py  scaffolds.all.camsa.points -o cout2

Error message:

2018-05-03 09:22:06,462 - CAMSA.main      - INFO    - Starting the analysis
2018-05-03 09:22:06,462 - CAMSA.main      - INFO    - Processing input
2018-05-03 09:22:06,505 - CAMSA.main      - INFO    - Merging assembly points from different sources into a set of unique ones.
2018-05-03 09:22:06,536 - CAMSA.main      - INFO    - Processing assemblies' subgroups
2018-05-03 09:22:06,539 - CAMSA.main      - INFO    - Computing assembly points conflicts
Traceback (most recent call last):
  File "/home/adam.rivers/anaconda3/envs/camsaenv/bin/run_camsa.py", line 135, in <module>
    compute_and_update_assembly_points_conflicts(assembly_points_by_ids=merged_assembly_points_by_ids)
  File "/home/adam.rivers/anaconda3/envs/camsaenv/lib/python3.6/site-packages/camsa/core/comparative_analysis.py", line 101, in compute_and_update_assembly_points_conflicts
    sag = construct_sag(assembly_points=assembly_points_by_ids.values())
  File "/home/adam.rivers/anaconda3/envs/camsaenv/lib/python3.6/site-packages/camsa/core/comparative_analysis.py", line 18, in construct_sag
    result.add_edge(u=u, v=v, weight=weight, ap_id=ap.self_id)
TypeError: add_edge() missing 2 required positional arguments: 'u_of_edge' and 'v_of_edge'

The fist few lines of the two points files are:

origin  seq1    seq1_or seq2    seq2_or gap_size        cw
scaffolds.all   k141_20552      -       k141_15348      +       -105    ? 
scaffolds.all   k141_35239      -       k141_88199      +       -200    ?
scaffolds.all   k141_54810      +       k141_60870      -       -181    ?
scaffolds.all   k141_60870      -       k141_50925      +       -271    ?
origin  seq1    seq1_or seq2    seq2_or gap_size        cw
Contigs_for_extension_clean     k141_79431      +       k141_39819      -       -202    ?
Contigs_for_extension_clean     k141_39819      -       k141_2749       +       -140    ?
Contigs_for_extension_clean     k141_2749       +       k141_6255       -       -128    ?
Contigs_for_extension_clean     k141_6255       -       k141_57986      -       -140    ?

I installed camsa using pip in an isolated conda environment using Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)

aganezov commented 6 years ago

Hi, thank you for reaching out, for using CAMSA, and for catching the bug.

This error seems to be the result of the API breaking change in the networkx library (as it transitioned form version 2.0 to 2.1) that CAMSA uses for graph related tasks.

A temporary workaround may be to ensure that in your isolated python environment in Anaconda the netowrkx library version is 2.0, and not 2.1. Please, try that and let me know if it works as a temporary fix.

Meanwhile, I'll see what I can do to ensure that CAMSA works with both older and newer versions of the networkx library.

arivers commented 6 years ago

Downgrading to networkx 2.0 solved it. Thanks!

As a quick workaround you can add a max dependency version to the setup.py file:

install_requires=[ 'networkx>=1.11,<=2', ...]

aganezov commented 6 years ago

I've updated the CAMSA distribution where this problem shall be fixed (with a networkx 2.0 version being the newest supported), putting a new 1.1.0b15 version online and it is available for install through pip, conda, etc.

Please, try it out and let me know how it works. Also, the camsa_points2fasts.py script is corrected in the new 1.1.0b15 version.

arivers commented 6 years ago

I upgraded to 1.1.0b15 and reran camsa_points2fasta.py but I'm still having trouble generating a fasta. The error I get is this:

CRITICAL:root:Fragment k141_25601 or k141_65179 which is present assembly points is not present in supplied fasta file. Exiting.

I think the problem is likely with how I'm using the script. I'm not sure which fastas and points files to feed into the system. I'm also not sure that I fed the correct scaffolds and contigs into fasta2camsa_points.py. I'll explain what I used and maybe you can tell me if it matches with what the program expects.

I had a small subset of scaffolds (35) created from a metagenomic assembly using spades and extended by lab work doing 5' and 3' RACE sequencing. I treated this file as my scaffolds file and added contigs from the original assembly and and alternate assembly created with Megahit. My goal was to extend and connect the 35 core contigs as much as possible.

Here is what I ran:

fasta2camsa_points.py megahit_contigs.fasta spades_contigs.fasta lab_confirmed_scaffolds.fasta -o test
run_camsa.py  spades_contigs.camsa.points -o testout1
camsa_points2fasta.py --fasta lab_confirmed_scaffolds.fasta --points ab_confirmed_scaffolds.points -o testout2
aganezov commented 6 years ago

when using the camsa_points2fasta.py script the --fasta flag shall refer to the fasta file with contigs sequences (i.e., the contigs that were used in each and every scaffold assembly that were together merged by CASMA were comprised of).

So in other words, lets say you have two scaffold assemblies s1.fasta and s2.fasta on the same set of contigs c.fasta. The following CAMSA pipeline can be used:

fasta2camsa_points.py c.fasta s1.fasta s2.fasta -o wd
run_camsa.py wd/s1.camsa.points wd/s2.camsa.points -o wd/camsa_results
camsa_points2fasta.py --fasta c.fasta --points wd/camsa_results/merged/merged.camsa.points -o merged.fasta

Please, also note that CAMSA works on order and orientation when producing the merged scaffold assembly. Gap merging is somewhat tangential to CAMSA point, and I would suggest that GAP merging be performed after producing the merged.fasta.

aganezov commented 5 years ago

Closing the issue as the problem seems to be resolved.