PacificBiosciences / pb-metagenomics-tools

Tools and pipelines tailored to using PacBio HiFi Reads for metagenomics
BSD 3-Clause Clear License
164 stars 33 forks source link

Job execution failed due to DAStool step #13

Closed CaroleBelliardo closed 2 years ago

CaroleBelliardo commented 2 years ago

Hello,

I run your workflow on 60cpu -- 500go RAM machine using slurm manager and a singularity install but my job crash at 50% of compleness. The slurm.err file display the following error message:

"output: 4-DAStool/poivron_I_octhifi_reads_DASTool_bins, 4-DAStool/poivron_I_oct__hifi_reads.complete.txt log: logs/poivron_I_octhifi_reads.RunDAStool.log (check log file(s) for error message) conda-env: /lerins/hub/projects/25_IPN_Metag/HiFi-MAG-Pipeline/.snakemake/conda/09d31837be12d68e9ae482cebc99e205 shell: DAS_Tool -i 4-DAStool/poivron_I_octhifi_reads.linear-circ.tsv,4-DAStool/poivron_I_oct__hifi_reads.full.tsv -c inputs/poivron_I_octhifi_reads.contigs.fasta -l lincirc,full -o 4-DAStool/poivron_I_octhifi_reads --search_engine diamond --write_bins 1 -t 60 &> logs/poivron_I_oct__hifi_reads.RunDAStool.log && touch 4-DAStool/poivron_I_octhifi_reads.complete.txt (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!) Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message "

And, the /lerins/hub/projects/25_IPN_Metag/HiFi-MAG-Pipeline/logs/poivron_I_oct__hifi_reads.RunDAStool.log says:

"Running DAS Tool using 60 threads. predicting genes using Prodigal V2.6.3: February, 2016 identifying single copy genes using mv: cannot stat '4-DAStool/poivron_I_octhifi_reads_proteins.faa.scg': No such file or directory mv: cannot stat '4-DAStool/poivron_I_octhifi_reads_proteins.faa.scg': No such file or directory rm: cannot remove '4-DAStool/poivron_I_octhifi_reads_proteins.faa.findSCG.b6': No such file or directory rm: cannot remove '4-DAStool/poivron_I_octhifi_reads_proteins.faa.scg.candidates.faa': No such file or directory rm: cannot remove '4-DAStool/poivron_I_oct__hifi_reads_proteins.faa.all.b6': No such file or directory single copy gene prediction using diamond failed. Aborting "

Could you help me please to understand what happen and what I could do to fix it.
Thank for your help, I can wait to see the results of your amazing pipeline!

dportik commented 2 years ago

Hello, It looks like diamond may have failed to find any matches between the input bins and the reference set of single copy genes included with DAS_Tool (see error reported here). Can you verify that 4-DAStool/poivron_I_oct__hifi_reads.linear-circ.tsv and 4-DAStool/poivron_I_oct__hifi_reads.full.tsv contain contigs that are assigned to bins?

CaroleBelliardo commented 2 years ago

Thank for you fast answer. Yes, both files contains contigs assigned to bin as : "contig_19801 poivron_I_oct__hifi_reads_bin.circ10 "

dportik commented 2 years ago

If there are many contigs assigned to bins (and not just one bin), it is not clear what the error is from. It may be specific to diamond. You could try running the search using blast or usearch instead, and seeing if the error persists.

To change this, you would need to alter the config.yaml file:

dastool:
  # The number of threads to use for DAS Tool.
  threads: 24
  # The engine for single copy gene searching, choices include:
  # blast, diamond, usearch.
  search: "diamond"

If you continue to encounter errors, the issue may be due to the sequences contained in the bins. Is it possible that the bin(s) do not contain valid bacteria/archaea sequences?

CaroleBelliardo commented 2 years ago

I have tried to run the workflow with BLAST and it seems to continue the process...I hope it could finish properly

nvpatin commented 2 years ago

I would like to add to this thread because my HiFi-MAG-Pipeline is also failing due to DASTool, although I think it is a different issue. I believe for some samples DASTool fails to find any high-quality bins, and in these cases it aborts the job and the entire workflow fails. Here is the Snakemake log file:

[Thu Nov 18 12:58:59 2021] rule RunDAStool: input: 4-DAStool/Las19c135_5m-3.full.tsv, 4-DAStool/Las19c135_5m-3.linear-circ.tsv, inputs/Las19c135_5m-3.contigs.fasta output: 4-DAStool/Las19c135_5m-3_DASTool_bins, 4-DAStool/Las19c135_5m-3.complete.txt log: logs/Las19c135_5m-3.RunDAStool.log jobid: 161 benchmark: benchmarks/Las19c135_5m-3.RunDAStool.tsv wildcards: sample=Las19c135_5m-3 threads: 24 resources: tmpdir=/tmp

Activating conda environment: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c Waiting at most 5 seconds for missing files. MissingOutputException in line 328 of /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/Snakefile-hifimags: Job Missing files after 5 seconds: 4-DAStool/Las19c135_5m-3_DASTool_bins This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait. Job id: 161 completed successfully, but some output files are missing. 161 Removing output files of failed job RunDAStool since they might be corrupted: 4-DAStool/Las19c135_5m-3.complete.txt Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

And here is the DASTool log file for the sample in question (however, I have seen other reasons for DASTool aborting, it is not always just because of low quality bins):

Running DAS Tool using 24 threads. predicting genes using Prodigal V2.6.3: February, 2016 identifying single copy genes using blastp: 2.12.0+ Package: blast 2.12.0, build Jul 13 2021 09:03:00 calculating contig lengths. evaluating bin-sets No bins with bin-score >0.5 found. Adjust score_threshold to report bins with lower quality. Aborting.

dportik commented 2 years ago

@nvpatin the short answer is that there is either unexpected behavior happening with DAS_Tool or you do not have any high quality bins in your assembly. The default score for DAS_Tool to keep a bin is >0.5, which is roughly the equivalent of 50% completeness in CheckM.

I've made two changes to the DAS rule, please download the new Snakefile-hifimags file and use it on these datasets. You do not need to download anything else for the workflow, just the snakefile, and can simply replace the one you currently have with the new version. It will run correctly with the environments you installed previously.

I turned on debug mode for DAS_Tool, so if anything crashes we will have a much better idea of why. I've also lowered the bin score to 0.05 to see if the issue is bin quality or something else. If you have another error, please copy/paste the contents of the log file.

dportik commented 2 years ago

@CaroleBelliardo I updated the Snakefile-hifimags file to allow debugging of DAS_Tool. If you could, please use the most recent version of this file to execute the workflow.

I suspect your initial issue has something to do with how diamond was installed. If you are able, please run the workflow using the diamond option and if the error occurs again, copy and paste the contents of the logs/SAMPLE.RunDAStool.log here.

nvpatin commented 2 years ago

Thanks Daniel. The workflow got farther than it did before, but still ended with an error. Here is the Snakemake error:

1 of 112 steps (1%) done Select jobs to execute...

[Thu Nov 18 15:45:17 2021] rule RunDAStool: input: 4-DAStool/1903c118_23m-3.full.tsv, 4-DAStool/1903c118_23m-3.linear-circ.tsv, inputs/1903c118_23m-3.contigs.fasta output: 4-DAStool/1903c118_23m-3_DASTool_bins, 4-DAStool/1903c118_23m-3.complete.txt log: logs/1903c118_23m-3.RunDAStool.log jobid: 31 benchmark: benchmarks/1903c118_23m-3.RunDAStool.tsv wildcards: sample=1903c118_23m-3 threads: 24 resources: tmpdir=/tmp

Activating conda environment: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c [Thu Nov 18 16:00:46 2021] Error in rule RunDAStool: jobid: 31 output: 4-DAStool/1903c118_23m-3_DASTool_bins, 4-DAStool/1903c118_23m-3.complete.txt log: logs/1903c118_23m-3.RunDAStool.log (check log file(s) for error message) conda-env: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c shell: DAS_Tool -i 4-DAStool/1903c118_23m-3.linear-circ.tsv,4-DAStool/1903c118_23m-3.full.tsv -c inputs/1903c118_23m-3.contigs.fasta -l lincir$ (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/log/2021-11-18T152727.088103.snakemake.log

And here is the RunDasTool.log file for the sample that the workflow failed on:

Building a new DB, current time: 11/18/2021 15:56:02 New DB name: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/1903c118_23m-3_proteins.faa New DB title: 4-DAStool/1903c118_23m-3_proteins.faa Sequence type: Protein Deleted existing Protein BLAST database named /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/1903c118_23m-3_proteins.f$ Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 545 sequences in 2.11301 seconds.

verifying selected SCGs... Warning: [blastp] Examining 5 or more matches is recommended starting annotations of single copy cogs... successfully finished no single copy genes found. Aborting

On Thu, Nov 18, 2021 at 1:23 PM Daniel Portik @.***> wrote:

@nvpatin https://github.com/nvpatin the short answer is that there is either unexpected behavior happening with DAS_Tool or you do not have any high quality bins in your assembly. The default score for DAS_Tool to keep a bin is >0.5, which is roughly the equivalent of 50% completeness in CheckM.

I've made two changes to the DAS rule, please download the new Snakefile-hifimags file and use it on these datasets. You do not need to download anything else for the workflow, just the snakefile, and can simply replace the one you currently have with the new version. It will run correctly with the environments you installed previously.

I turned on debug mode for DAS_Tool, so if anything crashes we will have a much better idea of why. I've also lowered the bin score to 0.05 to see if the issue is bin quality or something else. If you have another error, please copy/paste the contents of the log file.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/pb-metagenomics-tools/issues/13#issuecomment-973282785, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3EEDJMBI5VD742VJBGKVLUMVVFZANCNFSM5IIIG24Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Dr. Nastassia Patin

Postdoctoral Associate

Cooperative Institute for Marine and Atmospheric Studies

University of Miami/NOAA

Address:

AOML & SWFSC

8901 La Jolla Shores Drive

La Jolla, CA

92037

dportik commented 2 years ago

@nvpatin can you please copy/paste the complete contents of /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/log/2021-11-18T152727.088103.snakemake.log or attach the file?

Can you also copy/paste the complete contents of logs/SAMPLE.RunDAStool.log or attach the file? The contents you pasted do not look correct. The debug mode will produce a much more complete log file, and this information is key. It should start out looking like this:

DAS Tool run on  Thu Nov 18 13:01:19 PST 2021

User environment details:
Rscript path:  /dept/appslab/projects/old/2021/dp_snake-HiFi-MAG-Pipeline/.snakemake/conda/dc630e7d/bin/Rscript
Rscript version:  R scripting front-end version 4.1.1 (2021-08-10)
pullseq path:  /dept/appslab/projects/old/2021/dp_snake-HiFi-MAG-Pipeline/.snakemake/conda/dc630e7d/bin/pullseq
pullseq version:  Version: 1.0.2 Name lookup method: UTHASH
prodigal path:  /dept/appslab/projects/old/2021/dp_snake-HiFi-MAG-Pipeline/.snakemake/conda/dc630e7d/bin/prodigal
prodigal version:  Prodigal V2.6.3: February, 2016
ruby path:  /dept/appslab/projects/old/2021/dp_snake-HiFi-MAG-Pipeline/.snakemake/conda/dc630e7d/bin/ruby
ruby version:  ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
diamond path:  /dept/appslab/projects/old/2021/dp_snake-HiFi-MAG-Pipeline/.snakemake/conda/dc630e7d/bin/diamond
diamond version:  diamond version 2.0.13

Running DAS Tool using 24 threads.
predicting genes using Prodigal V2.6.3: February, 2016
identifying single copy genes using diamond version 2.0.13

Without seeing more, there is little I can do to troubleshoot.

CaroleBelliardo commented 2 years ago

Thank you for the update on the workflow. Now, it's running for a long time... maybe because blast is longer than diamond. I think it works but for the next job, I will use the new version.

dportik commented 2 years ago

@CaroleBelliardo Yes I would strongly recommend using the diamond option. Please keep me updated as to whether or not DAS_Tool finished successfully.

nvpatin commented 2 years ago

Sure, the complete content of 2021-11-18T152727.088103.snakemake.log is:

Building DAG of jobs... Using shell: /bin/bash Provided cores: 24 Rules claiming more threads will be scaled down. Job stats: job count min threads max threads


PlotCheckM 14 1 1 PrepBatchFile 14 1 1 RunCheckM 14 24 24 RunDAStool 13 24 24 RunGTDBTkIndividual 14 24 24 SummarizeCheckM 14 1 1 SummarizeResults 14 1 1 SummaryPlots 14 1 1 all 1 1 1 total 112 1 24

Select jobs to execute...

[Thu Nov 18 15:27:29 2021] rule RunDAStool: input: 4-DAStool/1903c119_11m-3.full.tsv, 4-DAStool/1903c119_11m-3.linear-circ.tsv, inputs/1903c119_11m-3.contigs.fasta output: 4-DAStool/1903c119_11m-3_DASTool_bins, 4-DAStool/1903c119_11m-3.complete.txt log: logs/1903c119_11m-3.RunDAStool.log jobid: 44 benchmark: benchmarks/1903c119_11m-3.RunDAStool.tsv wildcards: sample=1903c119_11m-3 threads: 24 resources: tmpdir=/tmp

Activating conda environment: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c [Thu Nov 18 15:45:17 2021] Finished job 44. 1 of 112 steps (1%) done Select jobs to execute...

[Thu Nov 18 15:45:17 2021] rule RunDAStool: input: 4-DAStool/1903c118_23m-3.full.tsv, 4-DAStool/1903c118_23m-3.linear-circ.tsv, inputs/1903c118_23m-3.contigs.fasta output: 4-DAStool/1903c118_23m-3_DASTool_bins, 4-DAStool/1903c118_23m-3.complete.txt log: logs/1903c118_23m-3.RunDAStool.log jobid: 31 benchmark: benchmarks/1903c118_23m-3.RunDAStool.tsv wildcards: sample=1903c118_23m-3 threads: 24 resources: tmpdir=/tmp

Activating conda environment: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c [Thu Nov 18 16:00:46 2021] Error in rule RunDAStool: jobid: 31 output: 4-DAStool/1903c118_23m-3_DASTool_bins, 4-DAStool/1903c118_23m-3.complete.txt log: logs/1903c118_23m-3.RunDAStool.log (check log file(s) for error message) conda-env: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c shell: DAS_Tool -i 4-DAStool/1903c118_23m-3.linear-circ.tsv,4-DAStool/1903c118_23m-3.full.tsv -c inputs/1903c118_23m-3.contigs.fasta -l lincirc,full -o 4-DAStool/1903c118_23m-3 --search_e$ (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/log/2021-11-18T152727.088103.snakemake.log

And the complete contents of 1903c118_23m-3_DasTool.log (in the "4-DAStool directory") are:

DAS Tool run on Thu Nov 18 15:45:19 CST 2021

User environment details: Rscript path: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/bin/Rscript Rscript version: R scripting front-end version 4.1.1 (2021-08-10) pullseq path: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/bin/pullseq pullseq version: Version: 1.0.2 Name lookup method: UTHASH prodigal path: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/bin/prodigal prodigal version: Prodigal V2.6.3: February, 2016 ruby path: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/bin/ruby ruby version: ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux] blastp path: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/bin/blastp blastp version: blastp: 2.12.0+ Package: blast 2.12.0, build Jul 13 2021 09:03:00

Warning: scaffolds2bin file is empty: 4-DAStool/1903c118_23m-3.full.tsv Running DAS Tool using 24 threads. predicting genes using Prodigal V2.6.3: February, 2016 identifying single copy genes using blastp: 2.12.0+ Package: blast 2.12.0, build Jul 13 2021 09:03:00

database name of all proteins is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/bac.all.faa database name of SCGs is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/bac.scg.faa database lookup is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/bac.scg.lookup

Building a new DB, current time: 11/18/2021 15:48:20 New DB name: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/bac.all.faa New DB title: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/bac.all.faa Sequence type: Protein Deleted existing Protein BLAST database named /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/bac.all.faa Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 48878 sequences in 1.34969 seconds.

finding SCG candidates...

Building a new DB, current time: 11/18/2021 15:49:52 New DB name: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/1903c118_23m-3_proteins.faa New DB title: 4-DAStool/1903c118_23m-3_proteins.faa Sequence type: Protein Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 545 sequences in 0.022619 seconds.

verifying selected SCGs... Warning: [blastp] Examining 5 or more matches is recommended starting annotations of single copy cogs... successfully finished database name of all proteins is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/arc.all.faa database name of SCGs is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/arc.scg.faa database lookup is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/arc.scg.lookup

Building a new DB, current time: 11/18/2021 15:54:25 New DB name: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/arc.all.faa New DB title: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/arc.all.faa Sequence type: Protein Deleted existing Protein BLAST database named /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/arc.all.faa Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 198238 sequences in 5.23223 seconds.

finding SCG candidates...

Building a new DB, current time: 11/18/2021 15:56:02 New DB name: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/1903c118_23m-3_proteins.faa New DB title: 4-DAStool/1903c118_23m-3_proteins.faa Sequence type: Protein Deleted existing Protein BLAST database named /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/1903c118_23m-3_proteins.faa Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 545 sequences in 2.11301 seconds.

verifying selected SCGs... Warning: [blastp] Examining 5 or more matches is recommended starting annotations of single copy cogs... successfully finished no single copy genes found. Aborting

And finally, the contents of 1903c118_23m-3.RunDASTool.log (in the "logs" directory) are:

Warning: scaffolds2bin file is empty: 4-DAStool/1903c118_23m-3.full.tsv Running DAS Tool using 24 threads. predicting genes using Prodigal V2.6.3: February, 2016 identifying single copy genes using blastp: 2.12.0+ Package: blast 2.12.0, build Jul 13 2021 09:03:00 database name of all proteins is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/bac.all.faa database name of SCGs is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/bac.scg.faa database lookup is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/bac.scg.lookup

Building a new DB, current time: 11/18/2021 15:48:20 New DB name: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/bac.all.faa New DB title: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/bac.all.faa Sequence type: Protein Deleted existing Protein BLAST database named /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/bac.all.faa Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 48878 sequences in 1.34969 seconds.

finding SCG candidates...

Building a new DB, current time: 11/18/2021 15:49:52 New DB name: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/1903c118_23m-3_proteins.faa New DB title: 4-DAStool/1903c118_23m-3_proteins.faa Sequence type: Protein Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 545 sequences in 0.022619 seconds.

verifying selected SCGs... Warning: [blastp] Examining 5 or more matches is recommended starting annotations of single copy cogs... successfully finished database name of all proteins is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/arc.all.faa database name of SCGs is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/arc.scg.faa database lookup is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/arc.scg.lookup

Building a new DB, current time: 11/18/2021 15:54:25 New DB name: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/arc.all.faa New DB title: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/arc.all.faa Sequence type: Protein Deleted existing Protein BLAST database named /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/arc.all.faa Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 198238 sequences in 5.23223 seconds.

finding SCG candidates...

Building a new DB, current time: 11/18/2021 15:56:02 New DB name: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/1903c118_23m-3_proteins.faa New DB title: 4-DAStool/1903c118_23m-3_proteins.faa Sequence type: Protein Deleted existing Protein BLAST database named /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/4-DAStool/1903c118_23m-3_proteins.faa Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 545 sequences in 2.11301 seconds.

verifying selected SCGs... Warning: [blastp] Examining 5 or more matches is recommended starting annotations of single copy cogs... successfully finished no single copy genes found. Aborting

On Thu, Nov 18, 2021 at 2:23 PM Daniel Portik @.***> wrote:

@nvpatin https://github.com/nvpatin can you please copy/paste the complete contents of /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/log/2021-11-18T152727.088103.snakemake.log ?

Can you also copy/paste the complete contents of logs/SAMPLE.RunDAStool.log or attach the file? The contents you pasted do not look correct. The debug mode will produce a much more complete log file, and this information is key. It should start out looking like this:

DAS Tool run on Thu Nov 18 13:01:19 PST 2021

User environment details: Rscript path: /dept/appslab/projects/old/2021/dp_snake-HiFi-MAG-Pipeline/.snakemake/conda/dc630e7d/bin/Rscript Rscript version: R scripting front-end version 4.1.1 (2021-08-10) pullseq path: /dept/appslab/projects/old/2021/dp_snake-HiFi-MAG-Pipeline/.snakemake/conda/dc630e7d/bin/pullseq pullseq version: Version: 1.0.2 Name lookup method: UTHASH prodigal path: /dept/appslab/projects/old/2021/dp_snake-HiFi-MAG-Pipeline/.snakemake/conda/dc630e7d/bin/prodigal prodigal version: Prodigal V2.6.3: February, 2016 ruby path: /dept/appslab/projects/old/2021/dp_snake-HiFi-MAG-Pipeline/.snakemake/conda/dc630e7d/bin/ruby ruby version: ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux] diamond path: /dept/appslab/projects/old/2021/dp_snake-HiFi-MAG-Pipeline/.snakemake/conda/dc630e7d/bin/diamond diamond version: diamond version 2.0.13

Running DAS Tool using 24 threads. predicting genes using Prodigal V2.6.3: February, 2016 identifying single copy genes using diamond version 2.0.13

Without seeing more, there is little I can do to troubleshoot.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/pb-metagenomics-tools/issues/13#issuecomment-973322195, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3EEDNMFTYI3NK5QKDK7Z3UMV4H5ANCNFSM5IIIG24Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Dr. Nastassia Patin

Postdoctoral Associate

Cooperative Institute for Marine and Atmospheric Studies

University of Miami/NOAA

Address:

AOML & SWFSC

8901 La Jolla Shores Drive

La Jolla, CA

92037

dportik commented 2 years ago

@nvpatin Thanks for sending these, it is very helpful!

One thing that jumped out is: Warning: scaffolds2bin file is empty: 4-DAStool/1903c118_23m-3.full.tsv This file should be full of bins if 1) there was a sufficiently good assembly, and 2) metabat2 ran correctly.

I have a few follow up questions to pin down what might be happening.

If there are no bins that have been created, the problem may be with the sample and/or the assembly.

nvpatin commented 2 years ago
  • Can you verify that 4-DAStool/1903c118_23m-3.full.tsv is in fact empty?

Yes, it is.

  • Are there any bins (*.fa files) in 3-metabat-bins-full/1903c118_23m-3/?

There is one fasta file in that directory: 1903c118_23m-3.full.tsv

  • Can you verify if 4-DAStool/1903c118_23m-3.linear-circ.tsv is also an empty file?

That file has one line, as follows: s7.ctg000008c 1903c118_23m-3_bin.circ1

  • Are there any bins (*.fa files) in 3-metabat-bins-linear-circ/1903c118_23m-3/? If so, do they say circ or lin in the name?

There is one fasta file in that directory: 1903c118_23m-3_bin.circ1.fa

If there are no bins that have been created, the problem may be with the sample and/or the assembly.

  • Is this a metagenomic assembly? That is, does the sample come from gut microbiome, environmental sample (soil, seawater), etc?
  • Have you looked at the assembly graph for this sample? I am wondering if there are very few contigs, or if assembly failed.

This is a metagenomic assembly from seawater, and indeed, the assembly graph is very small. Compared to other samples this one did not sequence well and there are very few contigs/unitigs. I guess I could exclude the lowest-quality assemblies from the analysis, but it would be better if the workflow could keep going despite one or more samples that don't produce good bins (and include this information in the final results).

I hope that helps! Really appreciate the attention to this problem and the fast responses.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/pb-metagenomics-tools/issues/13#issuecomment-973499914, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3EEDO2AM4WYMTEJCERXJ3UMWINLANCNFSM5IIIG24Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Dr. Nastassia Patin

Postdoctoral Associate

Cooperative Institute for Marine and Atmospheric Studies

University of Miami/NOAA

Address:

AOML & SWFSC

8901 La Jolla Shores Drive

La Jolla, CA

92037

dportik commented 2 years ago

@nvpatin Thanks for the details - this makes sense. My guess is that there aren't enough high quality contigs to create meaningful bins, and the one circular contig found in your assembly (1903c118_23m-3_bin.circ1.fa) may be a plasmid or viral sequence.

My recommendation would be to exclude this sample from the analysis and see if the workflow completes. For the DAS_Tool step I would also strongly recommend switching to diamond, rather than using blast. Diamond was not causing the problem, and it will be much faster than blast.

It is unfortunate that one sample can interfere with the workflow. I have not been able to find a suitable solution for this type of situation in snakemake yet!

nvpatin commented 2 years ago

Sounds good, thanks! This was helpful. I may have to exclude more than one sample from this workflow, we'll see.

On Thu, Nov 18, 2021 at 4:27 PM Daniel Portik @.***> wrote:

@nvpatin https://github.com/nvpatin Thanks for the details - this makes sense. My guess is that there aren't enough high quality contigs to create meaningful bins, and the one circular contig found in your assembly ( 1903c118_23m-3_bin.circ1.fa) may be a plasmid or viral sequence.

My recommendation would be to exclude this sample from the analysis and see if the workflow completes. For the DAS_Tool step I would also strongly recommend switching to diamond, rather than using blast. Diamond was not causing the problem, and it will be much faster than blast.

It is unfortunate that one sample can interfere with the workflow. I have not been able to find a suitable solution for this type of situation in snakemake yet!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/pb-metagenomics-tools/issues/13#issuecomment-973564586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3EEDN2TLMIDS753XNX4PTUMWKW7ANCNFSM5IIIG24Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Dr. Nastassia Patin

Postdoctoral Associate

Cooperative Institute for Marine and Atmospheric Studies

University of Miami/NOAA

Address:

AOML & SWFSC

8901 La Jolla Shores Drive

La Jolla, CA

92037

CaroleBelliardo commented 2 years ago

I think DAS_tool ending because now I have another crash with an error message :

GTDBTK_DATA_PATH=/lerins/hub/DB/GTDB/release20211115 gtdbtk classify_wf --batchfile 6-checkm-summary/aubergine_R_oct__hifi_reads.batchfile.txt --out_dir 7-gtdb-individual/aubergine_R_oct__hifi_reads/ -x fa --prefix aubergine_R_oct__hifi_reads --cpus 60 --pplacer_cpus 60 --scratch_dir /lerins/hub/tmp &> logs/aubergine_R_oct__hifi_reads.RunGTDBTkIndividual.log
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job RunGTDBTkIndividual since they might be corrupted:
7-gtdb-individual/aubergine_R_oct__hifi_reads/align, 7-gtdb-individual/aubergine_R_oct__hifi_reads/classify, 7-gtdb-individual/aubergine_R_oct__hifi_reads/identify
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /lerins/hub/projects/25_IPN_Metag/HiFi-MAG-Pipeline/.snakemake/log/2021-11-18T161836.792367.snakemake.log
dportik commented 2 years ago

@CaroleBelliardo This appears to be an error related to the GTDB-Tk, not DAS_Tool.

In general, to expedite troubleshooting it is most helpful to attach or paste the contents of the log file for the rule that failed. Showing the snakemake error is useful for seeing the rule that failed, but will not show the source of the problem within the rule.

You will need to copy and paste the entire contents of logs/aubergine_R_oct__hifi_reads.RunGTDBTkIndividual.log so we can identify what the problem is. I suspect that you may need to provide the trailing '/' on the input parameter /lerins/hub/DB/GTDB/release20211115 you specified in the config file (e.g., /lerins/hub/DB/GTDB/release20211115/), but let's look at the log file in detail first.

nvpatin commented 2 years ago

Hi Daniel, I have one by one been excluding samples that are causing the workflow fail, but now I have a new error. It seems to happen when the DASTool SAMPLE.linear-circ.tsv file has only one line with a scaffold name 'dummyseq.' Obviously this does not match up with any real contigs from the assembly. I assume this is another case where there were no high-quality bins generated; will I need to exclude all samples that only include a 'dummyseq' contig in the linear-circ.tsv file? Full error log is below:

Warning: scaffolds2bin file is empty: 4-DAStool/1903c122_28m-3.full.tsv Running DAS Tool using 24 threads. predicting genes using Prodigal V2.6.3: February, 2016 identifying single copy genes using diamond version 2.0.13 database name of all proteins is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/bac.all.faa database name of SCGs is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/bac.scg.faa database lookup is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/bac.scg.lookup diamond v2.0.13.151 (C) Max Planck Society for the Advancement of Science Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

CPU threads: 80

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Database input file: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/bac.all.faa Opening the database file... [0s] Loading sequences... [0.061s] Masking sequences... [0.062s] Writing sequences... [0.016s] Hashing sequences... [0.003s] Loading sequences... [0s] Writing trailer... [0.003s] Closing the input file... [0s] Closing the database file... [0.387s]

Database sequences 48878 Database letters 15503709 Database hash fcfdb67532286a4d19198f47da75a9ec Total time 0.536000s finding SCG candidates... diamond v2.0.13.151 (C) Max Planck Society for the Advancement of Science Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

CPU threads: 80

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Database input file: 4-DAStool/1903c122_28m-3_proteins.faa Opening the database file... [0s] Loading sequences... [0.003s] Masking sequences... [0.026s] Writing sequences... [0s] Hashing sequences... [0s] Loading sequences... [0s] Writing trailer... [0s] Closing the input file... [0s] Closing the database file... [0.001s]

Database sequences 2223 Database letters 585815 Database hash d66781aefbe1849aed98bd53751c2451 Total time 0.033000s diamond v2.0.13.151 (C) Max Planck Society for the Advancement of Science Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

CPU threads: 24

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Temporary directory: 4-DAStool

Target sequences to report alignments for: unlimited

Opening the database... [0.001s] Database: 4-DAStool/1903c122_28m-3_proteins.faa.dmnd (type: Diamond database, sequences: 2223, letters: 585815) Block size = 2000000000 Opening the input file... [0s] Opening the output file... [0s] Loading query sequences... [0.001s] Masking queries... [0.019s] Algorithm: Double-indexed Building query histograms... [0.007s] Allocating buffers... [0s] Loading reference sequences... [0.001s] Masking reference... [0.012s] Initializing temporary storage... [0.005s] Building reference histograms... [0.008s] Allocating buffers... [0s] Processing query block 1, reference block 1/1, shape 1/2, index chunk 1/4. Building reference seed array... [0.012s] Building query seed array... [0.009s] Computing hash join... [0.003s] Masking low complexity seeds... [0.004s] Searching alignments... [0.009s] Processing query block 1, reference block 1/1, shape 1/2, index chunk 2/4. Building reference seed array... [0.011s] Building query seed array... [0.01s] Computing hash join... [0.003s] Masking low complexity seeds... [0.004s] Searching alignments... [0.003s] Processing query block 1, reference block 1/1, shape 1/2, index chunk 3/4. Building reference seed array... [0.011s] Building query seed array... [0.01s] Computing hash join... [0.003s] Masking low complexity seeds... [0.004s] Searching alignments... [0.004s] Processing query block 1, reference block 1/1, shape 1/2, index chunk 4/4. Building reference seed array... [0.011s] Building query seed array... [0.011s] Computing hash join... [0.003s] Masking low complexity seeds... [0.004s] Searching alignments... [0.003s] Processing query block 1, reference block 1/1, shape 2/2, index chunk 1/4. Building reference seed array... [0.011s] Building query seed array... [0.01s] Computing hash join... [0.003s] Masking low complexity seeds... [0.004s] Searching alignments... [0.004s] Processing query block 1, reference block 1/1, shape 2/2, index chunk 2/4. Building reference seed array... [0.012s] Building query seed array... [0.011s] Computing hash join... [0.003s] Masking low complexity seeds... [0.005s] Searching alignments... [0.003s] Processing query block 1, reference block 1/1, shape 2/2, index chunk 3/4. Building reference seed array... [0.011s] Building query seed array... [0.011s] Computing hash join... [0.003s] Masking low complexity seeds... [0.004s] Searching alignments... [0.004s] Processing query block 1, reference block 1/1, shape 2/2, index chunk 4/4. Building reference seed array... [0.012s] Building query seed array... [0.011s] Computing hash join... [0.004s] Masking low complexity seeds... [0.004s] Searching alignments... [0.003s] Deallocating buffers... [0s] Clearing query masking... [0s] Computing alignments... [0.012s] Deallocating reference... [0s] Loading reference sequences... [0s] Deallocating buffers... [0s] Deallocating queries... [0s] Loading query sequences... [0s] Closing the input file... [0s] Closing the output file... [0s] Cleaning up... [0s] Total time = 0.463s Reported 0 pairwise alignments, 0 HSPs. 0 queries aligned. verifying selected SCGs... diamond v2.0.13.151 (C) Max Planck Society for the Advancement of Science Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

CPU threads: 24

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Temporary directory: 4-DAStool

Target sequences to report alignments for: 1

Opening the database... [0.011s] Database: 4-DAStool/all_prot.dmnd (type: Diamond database, sequences: 48878, letters: 15503709) Block size = 2000000000 Opening the input file... [0s] Error: Error detecting input file format. First line seems to be blank. verifying blast did not work mv: cannot stat ‘4-DAStool/1903c122_28m-3_proteins.faa.scg’: No such file or directory database name of all proteins is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/arc.all.faa database name of SCGs is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/arc.scg.faa database lookup is /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/arc.scg.lookup diamond v2.0.13.151 (C) Max Planck Society for the Advancement of Science Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

CPU threads: 80

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Database input file: /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/share/das_tool-1.1.3-0/db/arc.all.faa Opening the database file... [0.005s] Loading sequences... [0.245s] Masking sequences... [0.127s] Writing sequences... [0.066s] Hashing sequences... [0.016s] Loading sequences... [0s] Writing trailer... [0.004s] Closing the input file... [0s] Closing the database file... [0.002s]

Database sequences 198238 Database letters 62668793 Database hash 8f291b6761ce1a4a14fd7b172c40fe15 Total time 0.468000s finding SCG candidates... diamond v2.0.13.151 (C) Max Planck Society for the Advancement of Science Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

CPU threads: 80

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Database input file: 4-DAStool/1903c122_28m-3_proteins.faa Opening the database file... [0s] Loading sequences... [0.004s] Masking sequences... [0.027s] Writing sequences... [0s] Hashing sequences... [0s] Loading sequences... [0s] Writing trailer... [0s] Closing the input file... [0s] Closing the database file... [0.001s]

Database sequences 2223 Database letters 585815 Database hash d66781aefbe1849aed98bd53751c2451 Total time 0.035000s diamond v2.0.13.151 (C) Max Planck Society for the Advancement of Science Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

CPU threads: 24

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Temporary directory: 4-DAStool

Target sequences to report alignments for: unlimited

Opening the database... [0.001s] Database: 4-DAStool/1903c122_28m-3_proteins.faa.dmnd (type: Diamond database, sequences: 2223, letters: 585815) Block size = 2000000000 Opening the input file... [0.001s] Opening the output file... [0s] Loading query sequences... [0.005s] Masking queries... [0.014s] Algorithm: Double-indexed Building query histograms... [0.009s] Allocating buffers... [0s] Loading reference sequences... [0.001s] Masking reference... [0.004s] Initializing temporary storage... [0.029s] Building reference histograms... [0.009s] Allocating buffers... [0s] Processing query block 1, reference block 1/1, shape 1/2, index chunk 1/4. Building reference seed array... [0.012s] Building query seed array... [0.006s] Computing hash join... [0.004s] Masking low complexity seeds... [0.004s] Searching alignments... [0.004s] Processing query block 1, reference block 1/1, shape 1/2, index chunk 2/4. Building reference seed array... [0.011s] Building query seed array... [0.007s] Computing hash join... [0.003s] Masking low complexity seeds... [0.004s] Searching alignments... [0.003s] Processing query block 1, reference block 1/1, shape 1/2, index chunk 3/4. Building reference seed array... [0.011s] Building query seed array... [0.006s] Computing hash join... [0.002s] Masking low complexity seeds... [0.004s] Searching alignments... [0.004s] Processing query block 1, reference block 1/1, shape 1/2, index chunk 4/4. Building reference seed array... [0.011s] Building query seed array... [0.006s] Computing hash join... [0.003s] Masking low complexity seeds... [0.004s] Searching alignments... [0.003s] Processing query block 1, reference block 1/1, shape 2/2, index chunk 1/4. Building reference seed array... [0.011s] Building query seed array... [0.006s] Computing hash join... [0.003s] Masking low complexity seeds... [0.004s] Searching alignments... [0.004s] Processing query block 1, reference block 1/1, shape 2/2, index chunk 2/4. Building reference seed array... [0.011s] Building query seed array... [0.007s] Computing hash join... [0.003s] Masking low complexity seeds... [0.004s] Searching alignments... [0.003s] Processing query block 1, reference block 1/1, shape 2/2, index chunk 3/4. Building reference seed array... [0.011s] Building query seed array... [0.006s] Computing hash join... [0.003s] Masking low complexity seeds... [0.005s] Searching alignments... [0.003s] Processing query block 1, reference block 1/1, shape 2/2, index chunk 4/4. Building reference seed array... [0.01s] Building query seed array... [0.006s] Computing hash join... [0.003s] Masking low complexity seeds... [0.004s] Searching alignments... [0.003s] Deallocating buffers... [0s] Clearing query masking... [0s] Computing alignments... [0.03s] Deallocating reference... [0s] Loading reference sequences... [0s] Deallocating buffers... [0s] Deallocating queries... [0s] Loading query sequences... [0s] Closing the input file... [0s] Closing the output file... [0s] Cleaning up... [0s] Total time = 0.46s Reported 196 pairwise alignments, 196 HSPs. 129 queries aligned. verifying selected SCGs... diamond v2.0.13.151 (C) Max Planck Society for the Advancement of Science Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

CPU threads: 24

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Temporary directory: 4-DAStool

Target sequences to report alignments for: 1

Opening the database... [0.054s] Database: 4-DAStool/all_prot.dmnd (type: Diamond database, sequences: 198238, letters: 62668793) Block size = 2000000000 Opening the input file... [0s] Opening the output file... [0s] Loading query sequences... [0s] Masking queries... [0.005s] Algorithm: Double-indexed Building query histograms... [0.006s] Allocating buffers... [0s] Loading reference sequences... [0.109s] Masking reference... [0.175s] Initializing temporary storage... [0.006s] Building reference histograms... [0.155s] Allocating buffers... [0s] Processing query block 1, reference block 1/1, shape 1/2, index chunk 1/4. Building reference seed array... [0.047s] Building query seed array... [0.008s] Computing hash join... [0.004s] Masking low complexity seeds... [0.002s] Searching alignments... [0.003s] Processing query block 1, reference block 1/1, shape 1/2, index chunk 2/4. Building reference seed array... [0.053s] Building query seed array... [0.008s] Computing hash join... [0.004s] Masking low complexity seeds... [0.002s] Searching alignments... [0.002s] Processing query block 1, reference block 1/1, shape 1/2, index chunk 3/4. Building reference seed array... [0.057s] Building query seed array... [0.008s] Computing hash join... [0.004s] Masking low complexity seeds... [0.002s] Searching alignments... [0.003s] Processing query block 1, reference block 1/1, shape 1/2, index chunk 4/4. Building reference seed array... [0.042s] Building query seed array... [0.008s] Computing hash join... [0.004s] Masking low complexity seeds... [0.002s] Searching alignments... [0.003s] Processing query block 1, reference block 1/1, shape 2/2, index chunk 1/4. Building reference seed array... [0.042s] Building query seed array... [0.008s] Computing hash join... [0.004s] Masking low complexity seeds... [0.002s] Searching alignments... [0.003s] Processing query block 1, reference block 1/1, shape 2/2, index chunk 2/4. Building reference seed array... [0.053s] Building query seed array... [0.008s] Computing hash join... [0.004s] Masking low complexity seeds... [0.002s] Searching alignments... [0.002s] Processing query block 1, reference block 1/1, shape 2/2, index chunk 3/4. Building reference seed array... [0.057s] Building query seed array... [0.008s] Computing hash join... [0.004s] Masking low complexity seeds... [0.002s] Searching alignments... [0.002s] Processing query block 1, reference block 1/1, shape 2/2, index chunk 4/4. Building reference seed array... [0.041s] Building query seed array... [0.008s] Computing hash join... [0.004s] Masking low complexity seeds... [0.002s] Searching alignments... [0.002s] Deallocating buffers... [0s] Clearing query masking... [0s] Computing alignments... [0.013s] Deallocating reference... [0.001s] Loading reference sequences... [0s] Deallocating buffers... [0s] Deallocating queries... [0s] Loading query sequences... [0s] Closing the input file... [0s] Closing the output file... [0.001s] Cleaning up... [0s] Total time = 1.181s Reported 2 pairwise alignments, 2 HSPs. 2 queries aligned. starting annotations of single copy cogs... successfully finished /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/bin/DAS_Tool: line 428: 4-DAStool/1903c122_28m-3_proteins.faa.bacteria.scg: No such file or directory /work/hpc/users/nvp29/pb-metagenomics-tools/HiFi-MAG-Pipeline/.snakemake/conda/80640037491ed72c53fdf0fc4d22b33c/bin/DAS_Tool: line 428: [: -eq: unary operator expected calculating contig lengths. ERROR: Scaffold names of4-DAStool/1903c122_28m-3.linear-circ.tsvdo not match assembly headers: Format of 1903c122_28m-3.contigs.fasta: s0.ctg000001l s1.ctg000002l s10.ctg000011l s100.ctg000101l s101.ctg000102l s102.ctg000103l Format of 1903c122_28m-3.linear-circ.tsv: dummyseq

On Thu, Nov 18, 2021 at 4:27 PM Daniel Portik @.***> wrote:

@nvpatin https://github.com/nvpatin Thanks for the details - this makes sense. My guess is that there aren't enough high quality contigs to create meaningful bins, and the one circular contig found in your assembly ( 1903c118_23m-3_bin.circ1.fa) may be a plasmid or viral sequence.

My recommendation would be to exclude this sample from the analysis and see if the workflow completes. For the DAS_Tool step I would also strongly recommend switching to diamond, rather than using blast. Diamond was not causing the problem, and it will be much faster than blast.

It is unfortunate that one sample can interfere with the workflow. I have not been able to find a suitable solution for this type of situation in snakemake yet!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/pb-metagenomics-tools/issues/13#issuecomment-973564586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3EEDN2TLMIDS753XNX4PTUMWKW7ANCNFSM5IIIG24Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Dr. Nastassia Patin

Postdoctoral Associate

Cooperative Institute for Marine and Atmospheric Studies

University of Miami/NOAA

Address:

AOML & SWFSC

8901 La Jolla Shores Drive

La Jolla, CA

92037

dportik commented 2 years ago

@nvpatin Yes it looks like that sample also does not have any high quality contigs.

I just updated the workflow again, this time with a step (rule CheckForBins) that will check whether or not there are bins present for the sample before attempting to run DAS_Tool. It will cause the workflow to stop with an error, but the log file called logs/SAMPLE.CheckForBins.log will have details on why the sample failed. If there are no bins created but there are circular contigs present, the sample could move on to DAS_Tool and fail in the same way as your first sample if the contigs are not bacteria/archaea. So, low quality assemblies are typically expected to fail at step CheckForBins, but sometimes they can fail at step RunDAStool.

I think this new step will be helpful in distinguishing between low quality assemblies and errors that are specific to DAS_Tool. To run it, you'll need to download the new snakemake file as well as the scripts folder (several of them have been updated). Hopefully this can help you determine which samples to exclude faster.

dportik commented 2 years ago

@CaroleBelliardo is there any update on your issue?