Hydro3639 / NanoPhase

Reference-quality genome reconstruction from complex metagenomes (or bacterial isolates) using only Nanopore long reads or both long and short reads (hybrid strategy)
MIT License
26 stars 1 forks source link

Initial Binning fails #7

Open winterlich opened 1 year ago

winterlich commented 1 year ago

Hi there, I just tried Nanophase, both with one of my datasets and with the example dataset. The assembly using flye --meta works fine, but the pipeline keeps terminating at the initial binning step. The logfile of MetaBat2 shows only this: MetaBAT 2 (v2.12.1) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200.

I tried the version 0.2.2 and 0.2.3 but both versions did not work with mine datasets or the example dataset.

The NanoPhase check shows this information:

Check software availability and locations The following packages have been found

package location

flye /home/xxx/anaconda3/envs/nanophase0.2.2/bin/flye metabat2 /home/xxx/anaconda3/envs/nanophase0.2.2/bin/metabat2 maxbin2 /home/xxx/anaconda3/envs/nanophase0.2.2/bin/run_MaxBin.pl metawrap /home/xxx/anaconda3/envs/nanophase0.2.2/bin/metawrap checkm /home/xxx/anaconda3/envs/nanophase0.2.2/bin/checkm racon /home/xxx/anaconda3/envs/nanophase0.2.2/bin/racon medaka /home/xxx/anaconda3/envs/nanophase0.2.2/bin/medaka polypolish /home/xxx/anaconda3/envs/nanophase0.2.2/bin/polypolish POLCA /home/xxx/anaconda3/envs/nanophase0.2.2/bin/polca.sh bwa /home/xxx/anaconda3/envs/nanophase0.2.2/bin/bwa seqtk /home/xxx/anaconda3/envs/nanophase0.2.2/bin/seqtk minimap2 /home/xxx/software/ont-guppy/bin/minimap2 BBMap /home/xxx/anaconda3/envs/nanophase0.2.2/bin/BBMap parallel /home/xxx/anaconda3/envs/nanophase0.2.2/bin/parallel perl /home/xxx/anaconda3/envs/nanophase0.2.2/bin/perl samtools /home/xxx/anaconda3/envs/nanophase0.2.2/bin/samtools gtdbtk /home/xxx/anaconda3/envs/nanophase0.2.2/bin/gtdbtk fastANI /home/xxx/anaconda3/envs/nanophase0.2.2/bin/fastANI blastp /home/xxx/anaconda3/envs/nanophase0.2.2/bin/blastp All required packages have been found in the environment. If the above certain packages integrated into nanophase were used in your investigation, please give them credit as well :) grep: warning: stray \ before / Warning: [flye metabat2 maxbin2 metawrap checkm racon medaka polypolish POLCA bwa seqtk BBMap parallel perl samtools gtdbtk fastANI blastp minimap2] has not been installed in the [nanophase] env. Strongly recommend intalling all packages in the nanophase env, or it may result in a failure

This message is confusing, since the required packages are installed and found, but the pipeline keeps warning about missing software.

Anyway, I would love to test your pipeline. Please let me know, if i can provide any additional information for this issue.

Hydro3639 commented 1 year ago

Hi, could you provide the command that you used?

winterlich commented 1 year ago

Sure: For the example dataset, I used this command: nanophase meta -l lr.fa.gz -t 24 -o ont-nanophase-out

for my own datasets, i used the same command, but modified the files and output folder, obviously.

Hydro3639 commented 1 year ago

I guess the confusing message you mentioned before is due to an installation issue. for example, the name of conda env should be nanophase0.2.2, but somehow, as I can see from the log file, you activated nanophase env using a command like conda activate nanophase, but the nanophase command invoked was under the nanophase0.2.2 env. Because they are only warning messages, so no need to worry about this.

Before I can identify the potential issues, could you use the following command (after activation of the nanophase package) to see what exactly has happened for metabat binning: metabat2 -t 16 -i ont-nanophase-out/01-LongAssemblies/assembly.fasta -o ont-nanophase-out02-LongBins/INITIAL_BINNING/metabat2/metabat2-bins/bin -a ont-nanophase-out/02-LongBins/INITIAL_BINNING/metabat2/metabat2_abun.txt --cvExt

winterlich commented 1 year ago

Thanks for your answer. I performed the analysis again, using the correct environments for nanophase 0.2.2 and nanophase 0.2.3 but still got the same error. The metabat2 command you suggested results in no additional results, but gives a "segmentation fault".

Hydro3639 commented 1 year ago

It is weird for me, I can't reproduce this error using the example dataset. I would suggest removing the whole package of nanophase 0.2.2 and re-install it to see if this problem could be resolved.

aljazdzy commented 1 year ago

I am having this exact same issue with the exact same results as this thread. Winterlich did you ever solve the problem?

winterlich commented 1 year ago

Okay, that is interesting. I reinstalled the package as suggested, but this did not resolve the problem. I wasn't able to dive deeper into this, so far. But I am happy for any suggestions........

aljazdzy commented 1 year ago

Hmm, what are the general size of your reads? Mine are admittedly kind of small for nanopore and its possible that flye is filtering too many so as that metabat2 does not have enough information to work with.

winterlich commented 1 year ago

Thats a good idea, my read sets are also rather small. I will try another, larger dataset these days and will report on this...

Hydro3639 commented 1 year ago

Thank you both for your contributions!

If only a small long-read dataset was provided, it would be pretty challenging to perform genome binning. If you wanted to try nanophase with a long-read dataset, we had sequenced a mock community (you can find more details about the mock community in our paper) using nanopore sequencing and uploaded it to NCBI. The dataset can be downloaded via the following command: (you may need to install sra-tools)

fastq-dump SRR17913199

Please don't hesitate to let me know if I can help.

Best

aljazdzy commented 1 year ago

So I ran it using the provided practice data set from your setup page and this was my result:

All required packages have been found in the environment. If the above certain packages integrated into nanophase were used in your investigation, please give them credit as well :) [2023-06-14 12:08:04] TASK: Long-read assembly starts (be patient) [2023-06-14 12:16:40] DONE: long-read assembly finished sucessfully: detailed log file is miniconda3envsNanophasedir/01-LongAssemblies/flye.log [2023-06-14 12:16:40] TASK: Initial binning::metabat2 binning starts /root/miniconda3/envs/nanophase-v0.2.2/bin/nanophase.meta: line 245: 3997 Segmentation fault metabat2 -t $N_threads -i $OutDIR/01-LongAssemblies/assembly.fasta -o $OutDIR/02-LongBins/INITIAL_BINNING/metabat2/metabat2-bins/bin -a $OutDIR/02-LongBins/INITIAL_BINNING/metabat2/metabat2_abun.txt --cvExt > $OutDIR/02-LongBins/INITIAL_BINNING/metabat2/bin.log [2023-06-14 12:16:40] ERROR: Something wrong with metabat2 binning, please also check miniconda3envsNanophasedir/02-LongBins/INITIAL_BINNING/metabat2/bin.log, terminating...

So the I would guess that the issue is going beyond just the data we are providing, though as of yet/what I am not sure. I'm not running this on the world's most powerful computer either, is it possible I'm hitting a CPU bottleneck? I'm running it on a laptop with an i7 1360p (12 cores, 5Ghz) with 32 GBs of RAM. The ram is definitely not the bottleneck but I'm noticing my CPU is hitting 100% utilization during this run.

Hydro3639 commented 1 year ago

Did you mean the lr.fa.gz in the Example dataset?

aljazdzy commented 1 year ago

Yes! Is there a better one I should run?

Hydro3639 commented 1 year ago

I am still unsure what happened, I would expect the command to exit at the semibin stage rather than metabat2 if you use the provided lr.fa.gz. If you want to try v0.2.3, you can download the long-read dataset: SRR17913199, as I mentioned earlier. Is that possible for you to run it on a Linux workstation?

aljazdzy commented 1 year ago

I actually did run this on ubuntu on a windows subsystem, I don't have a workstation though. I did originally do this on v0.2.3 and I had gotten the same output with my data, I didn't try it with the practice data though. I can try it with the specific long-read data set as well though.

aljazdzy commented 1 year ago

Ok so I ran the specified data-set on v0.2.3 and this was my output: (nanophase) root@Andrew:~/miniconda3/envs/nanophase# nanophase meta -l SRR17913199.fastq -t 16 -o Practice [2023-06-21 13:26:03] INFO: nanophase (meta) starts [2023-06-21 13:26:03] INFO: Command line: /root/miniconda3/envs/nanophase/bin/nanophase meta -l SRR17913199.fastq -t 16 -o Practice [2023-06-21 13:26:03] INFO: long_read_only model was selected, only Nanopore long reads will be used [2023-06-21 13:26:03] CHECK: Nanopore long-read (fastq) file has been found [2023-06-21 13:26:03] CHECK: Check software availability and locations [2023-06-21 13:26:03] INFO: The following packages have been found

package location

nanophase /root/miniconda3/envs/nanophase/bin/nanophase flye /root/miniconda3/envs/nanophase/bin/flye metabat2 /root/miniconda3/envs/nanophase/bin/metabat2 maxbin2 /root/miniconda3/envs/nanophase/bin/run_MaxBin.pl SemiBin /root/miniconda3/envs/nanophase/bin/SemiBin metawrap /root/miniconda3/envs/nanophase/bin/metawrap checkm /root/miniconda3/envs/nanophase/bin/checkm racon /root/miniconda3/envs/nanophase/bin/racon medaka /root/miniconda3/envs/nanophase/bin/medaka polypolish /root/miniconda3/envs/nanophase/bin/polypolish POLCA /root/miniconda3/envs/nanophase/bin/polca.sh bwa /root/miniconda3/envs/nanophase/bin/bwa seqtk /root/miniconda3/envs/nanophase/bin/seqtk minimap2 /root/miniconda3/envs/nanophase/bin/minimap2 BBMap /root/miniconda3/envs/nanophase/bin/BBMap parallel /root/miniconda3/envs/nanophase/bin/parallel perl /root/miniconda3/envs/nanophase/bin/perl samtools /root/miniconda3/envs/nanophase/bin/samtools gtdbtk /root/miniconda3/envs/nanophase/bin/gtdbtk fastANI /root/miniconda3/envs/nanophase/bin/fastANI All required packages have been found in the environment. If the above certain packages integrated into nanophase were used in your investigation, please give them credit as well :) [2023-06-21 13:26:03] TASK: Long-read assembly starts (be patient) [2023-06-21 13:28:49] ERROR: Something wrong with long-read (metaflye) assembly, please also check Practice/01-LongAssemblies/tmp/flye.log.debug for more information, terminating...

Which is different than our previous outputs, I looked into the flye.log.debug and it looks like my system just ran out of memory (oops), so I didn't get much data out of that attempt. Mayhaps I shall try again. The bin log is showing conitigs being created from my previous attempts with my own data, I think flye might just be set to too high an overlap.