Shamir-Lab / SCAPP

SCAPP is a plasmid assembly tool. This tool is described in our paper: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-021-01068-z
MIT License
29 stars 6 forks source link

SCAPP error on Flye assembly #40

Closed brambloemen closed 4 months ago

brambloemen commented 4 months ago

I'm using SCAPP as part of a snakemake workflow, on Flye --meta assemblies of ONT reads, after converting the assembly_graph.gfa to fastg format. For some datasets, it runs perfectly fine, but for others it seems to run into issues and I can't seem to figure out what is the cause.

The snakemake log:

Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 20
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=250000, mem_mib=238419, disk_mb=20526, disk_mib=19576, cpus_per_task=20
Select jobs to execute...
Execute 1 jobs...

[Tue Jul  9 17:47:34 2024]
localrule scapp:
    input: results/F4D4/SCAPP/assembly_graph.fastg, results/F4D4/SCAPP/F4D4_graphmap.bam
    output: results/F4D4/SCAPP/assembly_graph.confident_cycs.fasta, results/F4D4/SCAPP/intermediate_files/assembly_graph.cycs.paths.txt
    log: logs/SCAPP/F4D4_scapp.log
    jobid: 0
    reason: Forced execution
    wildcards: sample=F4D4
    resources: mem_mb=250000, mem_mib=238419, disk_mb=20526, disk_mib=19576, tmpdir=/tmp, cpus_per_task=20

Activating conda environment: .snakemake/conda/c5088303afbc54515d56809f4848cbbb_
Getting scores of graph nodes
Getting scores of graph nodes
Getting scores of graph nodes
Getting scores of graph nodes
Getting scores of graph nodes
Finding plasmid-specific genes with BLAST
No blast db provided, creating one
Running command: makeblastdb  -in results/F4D4/SCAPP/assembly_graph.fastg -blastdb_version 4 -dbtype nucl -out results/F4D4/SCAPP/intermediate_files/assembly_graph.fastg.blastdb
Running blast search for gene (nt) sequences in blast db
Running command: blastn -task megablast -db results/F4D4/SCAPP/intermediate_files/assembly_graph.fastg.blastdb -query /scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_/lib/python3.7/site-packages/scapp/data/nt/nt1 -out results/F4D4/SCAPP/intermediate_files/nt1_blastdb.out -num_threads 20 -outfmt "6 qseqid sseqid length pident qlen slen evalue"
Running command: blastn -task megablast -db results/F4D4/SCAPP/intermediate_files/assembly_graph.fastg.blastdb -query /scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_/lib/python3.7/site-packages/scapp/data/nt/nt2 -out results/F4D4/SCAPP/intermediate_files/nt2_blastdb.out -num_threads 20 -outfmt "6 qseqid sseqid length pident qlen slen evalue"
Running command: blastn -task megablast -db results/F4D4/SCAPP/intermediate_files/assembly_graph.fastg.blastdb -query /scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_/lib/python3.7/site-packages/scapp/data/nt/nt3 -out results/F4D4/SCAPP/intermediate_files/nt3_blastdb.out -num_threads 20 -outfmt "6 qseqid sseqid length pident qlen slen evalue"
Running blast search for protein (aa) sequences in blast db
Running command: tblastn -db results/F4D4/SCAPP/intermediate_files/assembly_graph.fastg.blastdb -db_gencode 11 -query /scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_/lib/python3.7/site-packages/scapp/data/aa/aa1 -out results/F4D4/SCAPP/intermediate_files/aa1_blastdb.out -num_threads 4 -outfmt "6 qseqid sseqid length pident qlen slen evalue"
Running command: tblastn -db results/F4D4/SCAPP/intermediate_files/assembly_graph.fastg.blastdb -db_gencode 11 -query /scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_/lib/python3.7/site-packages/scapp/data/aa/aa2 -out results/F4D4/SCAPP/intermediate_files/aa2_blastdb.out -num_threads 4 -outfmt "6 qseqid sseqid length pident qlen slen evalue"
56 contigs hit
Writing list of hit contigs in: results/F4D4/SCAPP/intermediate_files/hit_seqs.out
Removing intermediate files...
Starting SCAPP plasmid finding
('edge_512_len_29701_cov_44.0',)

('edge_997_len_13204_cov_28.0',)

('edge_65_len_1138_cov_6419.0',)

('edge_1011_len_11132_cov_13.0',)

('edge_791_len_14325_cov_51.0',)

('edge_1031_len_13315_cov_19.0',)

('edge_1018_len_40902_cov_8.0',)

('edge_1033_len_1354_cov_13.0',)

('edge_1012_len_8473_cov_39.0',)

('edge_1029_len_106881_cov_64.0',)

('edge_663_len_63757_cov_51.0',)

('edge_973_len_26280_cov_16.0',)

('edge_488_len_8702_cov_246.0',)

('edge_847_len_81768_cov_43.0',)

("edge_1009_len_68255_cov_19.0'",)

('edge_822_len_6899_cov_2.0',)

('edge_484_len_7533_cov_236.0',)

('edge_1022_len_53209_cov_12.0',)

('edge_548_len_37281_cov_30.0',)

('edge_1038_len_20632_cov_222.0',)

('edge_317_len_1203_cov_1655.0',)

('edge_986_len_37388_cov_3.0',)

('edge_336_len_35997_cov_314.0',)

('edge_392_len_1134_cov_3356.0',)

('edge_640_len_326266_cov_34.0',)

('edge_1039_len_13143_cov_58.0',)

('edge_899_len_3266_cov_0.0',)

('edge_632_len_1187_cov_1.0',)

('edge_545_len_41701_cov_123.0',)

("edge_333_len_2201_cov_388.0'",)

('edge_675_len_5943_cov_11.0',)

('edge_835_len_10322_cov_41.0',)

('edge_993_len_8163_cov_31.0',)

('edge_620_len_1816_cov_133.0',)

('edge_191_len_1196_cov_6827.0',)

('edge_1035_len_16969_cov_5.0',)

('edge_471_len_39374_cov_584.0',)

('edge_533_len_1128_cov_3970.0',)

('edge_1024_len_12784_cov_50.0',)

('edge_987_len_6838_cov_1.0',)

('edge_222_len_1095_cov_1.0',)

('edge_686_len_23431_cov_13.0',)

('edge_1003_len_22382_cov_7.0',)

('edge_858_len_35413_cov_4.0',)

('edge_941_len_12856_cov_437.0',)

('edge_493_len_8174_cov_13.0',)

('edge_343_len_8431_cov_246.0',)

('edge_566_len_36819_cov_9.0',)

('edge_1030_len_82399_cov_14.0',)

("edge_179_len_2051_cov_2.0'",)

('edge_991_len_25873_cov_68.0',)

('edge_1034_len_14944_cov_20.0',)

('edge_842_len_14524_cov_25.0',)

('edge_1037_len_12736_cov_19.0',)

('edge_672_len_1024_cov_261.0',)

('edge_515_len_1130_cov_13727.0',)

('edge_715_len_2875_cov_4.0',)

('edge_1023_len_37372_cov_6.0',)

('edge_3_len_2042752_cov_4700.0',)

('edge_852_len_6946_cov_6.0',)

('edge_998_len_100319_cov_38.0',)

('edge_193_len_1465_cov_6226.0',)

================== Added paths ====================
2 nodes remain in component
[Tue Jul  9 17:50:36 2024]
Error in rule scapp:
    jobid: 0
    input: results/F4D4/SCAPP/assembly_graph.fastg, results/F4D4/SCAPP/F4D4_graphmap.bam
    output: results/F4D4/SCAPP/assembly_graph.confident_cycs.fasta, results/F4D4/SCAPP/intermediate_files/assembly_graph.cycs.paths.txt
    log: logs/SCAPP/F4D4_scapp.log (check log file(s) for error details)
    conda-env: /scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_
    shell:

        scapp -g results/F4D4/SCAPP/assembly_graph.fastg -o results/F4D4/SCAPP -k 0 -b results/F4D4/SCAPP/F4D4_graphmap.bam -p 20 2>logs/SCAPP/F4D4_scapp.log

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Storing output in storage.
WorkflowError:
At least one job did not complete successfully.
srun: error: bioit-rd: task 0: Exited with exit code 1

The error given by scapp:

Traceback (most recent call last):
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_/bin/scapp", line 10, in <module>
    sys.exit(main())
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_/lib/python3.7/site-packages/scapp/scapp.py", line 347, in main
    PARAMS.MAX_CV, PARAMS.MIN_LENGTH)
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_/lib/python3.7/site-packages/scapp/recycle.py", line 193, in run_scapp
    path_set = process_component(COMP, G, max_k, min_length, max_CV, SEQS, bamfile, pool, use_scores, use_genes, num_procs)
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_/lib/python3.7/site-packages/scapp/scapp_utils.py", line 674, in process_component
    path_tuples.append((get_wgtd_path_coverage_CV(p,G,SEQS,max_k_val=max_k), p))
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_/lib/python3.7/site-packages/scapp/scapp_utils.py", line 184, in get_wgtd_path_coverage_CV
    mean, std = get_path_mean_std(path, G, seqs, max_k_val,discount=True)
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_/lib/python3.7/site-packages/scapp/scapp_utils.py", line 229, in get_path_mean_std
    covs = np.array(get_path_covs(path,G,discount))
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_/lib/python3.7/site-packages/scapp/scapp_utils.py", line 215, in get_path_covs
    covs = [get_discounted_node_cov(n,path,G) for n in path]
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_/lib/python3.7/site-packages/scapp/scapp_utils.py", line 215, in <listcomp>
    covs = [get_discounted_node_cov(n,path,G) for n in path]
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/c5088303afbc54515d56809f4848cbbb_/lib/python3.7/site-packages/scapp/scapp_utils.py", line 207, in get_discounted_node_cov
    node_cov *= in_path_cov/(non_path_cov + in_path_cov)
ZeroDivisionError: float division by zero
dpellow commented 4 months ago

SCAPP is not really designed to work with metaflye on long reads but it may be possible so I'd like to see if we can figure out what is wrong. Do you have the intermediate SCAPP files and log files from this run and can you share them? Is it possible to share the fastg that is causing this error as well?

brambloemen commented 4 months ago

Here are the intermediate files: intermediate_files.tar.gz

I used it quite a lot before on flye assemblies, and until now it has worked quite well on almost all datasets. I'm trying to look into what changed in my pipeline that caused this issue.

dpellow commented 4 months ago

OK, good to hear that it worked on assemblies before! The error message shows that there is a divide by zero error where it tries to divide one coverage by the other. Looking in your fastg file (just do a text search for cov_0) I see there are sequences that are said to have 0 coverage. I don't know exactly what this means or why it might be the case in this sample. If you can solve the issue of why there are sequences with 0 coverage in the graph, then SCAPP should work as expected. (Note you could just replace all places it says cov_0.0 with cov_0.0001 and it should work, but it's still worth trying to figure out what is going on in this sample)

brambloemen commented 4 months ago

Thank you, I was indeed able to fix it by substituting the cov0.0 with cov0.0001 when depth of coverage was 0 in the metaflye_gfa2fastg script. The issue is caused by the way flye polishing works, with no reads being mapped against some short edges after polishing: https://github.com/mikolmogorov/Flye/issues/473