Error at read-correction step #1

Closed fahadkhokhar closed 4 years ago

fahadkhokhar commented 4 years ago

Running the script with the test data provided, errors at the read-correction stage:

Starting command on Wed May 20 12:05:03 2020 with 39.037 GB free disk space

  cd /home/ubuntu/bin/NanoCLUST/work/0c/cdad9cbbc33ab4512480bf701f33cb
  sbatch \
    --cpus-per-task=1 \
    --mem-per-cpu=4g   \
    -D `pwd` \
    -J 'canu_corrected_reads' \
    -o canu-scripts/canu.01.out  canu-scripts/

Finished on Wed May 20 12:05:03 2020 (like a bat out of hell) with 39.037 GB free disk space

gzip: corrected_reads.correctedReads.fasta.gz: No such file or directory

genomicsITER commented 4 years ago

Hi! Thank you for contacting us.

We've updated some Docker images at Dockerhub and also the repository Dockerfiles under conda_envs/ (including read_correction module) due to problems with the enviroment path.

The pipeline has been now tested in Ubuntu 18.04 using both conda (v4.8.3) and docker (v19.03.9) with Nextflow v20.01.0 and the test profile. If the problem persist, feel free to contact again and include the executing command and any information about the configuration used.

nextflow run -profile test,conda
genomicsITER commented 4 years ago

A new push has been made with the latests updates.

fahadkhokhar commented 4 years ago

Thanks for your reply.

Still having the same issue though - I ran the command for the test data:

nextflow run -profile test,conda

On the first occasion it generated the .fasta.gz file but still gave the same error. I have just run again with the same error, this time no file was generated.

genomicsITER commented 4 years ago

If nextflow, python/pip + conda configuration is ok, it seems that the read_clustering conda environment is not working properly.

Try to remove this env (under work/conda directory) and run the pipeline again to reinstall the enviroment and retry the process. If it doesn't work, running the pipeline with '-profile test,docker' will automatically use docker images pulled from Dockerhub that are also tested. Please let us know if the problem persist with conda and docker profiles.

DavidFY-Hub commented 4 years ago

this pipeline could run in mac os???

genomicsITER commented 4 years ago

Nextflow and both conda and docker are compatible with Mac os. We've not tested on a Mac machine but maybe it could be run with the docker profile to avoid compatibility errors.

We've updated the pipeline and now we include the exact version tags in the enviroment.yml files used for conda envs. This should fix some errors with conda environments that arise in some machines. Also the docker images include a correct version of the environments.

fahadkhokhar commented 4 years ago

Having more luck with the docker option which ran fine with the test data. However, still problem at the read correction with my own dataset even with the docker option:

Error executing process > 'read_correction (15)'

Caused by: Process read_correction (15) terminated with an error exit status (1)

Command executed:

head -n$(( null*4 )) 20.fastq > subset.fastq canu -correct -p corrected_reads -nanopore-raw subset.fastq genomeSize=1.5k stopOnLowCoverage=1 minInputCoverage=2 gunzip corrected_reads.correctedReads.fasta.gz READ_COUNT=$(( $(awk '{print $1/2}' <(wc -l corrected_reads.correctedReads.fasta)) )) cat 20.log > 20_racon.log echo -n ";null;$READ_COUNT;" >> 20_racon.log && cp 20_racon.log 20racon.log

Command exit status: 1

Command output: (empty)

Command error: line 2: null: unbound variable

Work dir: /home/ubuntu/bin/NanoCLUST/work/92/dbe622ba2c7fd61f3835b2b7a93174

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named

Apologies if this is an obvious error on my part!

fahadkhokhar commented 4 years ago

Also, the test data run errors at the classification step when specifying --db and --tax as I had originally downloaded these to a separate volume. Haven't got to this step on my own data yet.

Is there an option to change the working directory?

genomicsITER commented 4 years ago

The first problem you report is due to a typo in the assignment to the default value of --polishing-reads when it is not set up in the command. You can check in conf/test.conf that this value is set to 20 when using the test profile. We've updated the pipeline to fix the typo and set a default value of --polishing-reads to 100 when using no profile confs at all. If you are running the pipeline with your own data we strongly recommend to manually assign --polishing_reads and --min_cluster_size parameters in order to compare pipeline outputs specially at low taxonomic levels such as species.

For db and taxdb parameters, you should write the full path using double quotes: --db "/home/nanoclust_vm/NanoCLUST/db/16S_ribosonal_RNA" --tax "/home/nanoclust_vm/NanoCLUST/db/taxdb/"

According to the Nextflow documentation, you can use -w set the working directory:

nextflow run <script> -w /some/scratch/dir

Thank you for your time and feedback! We've also modified the documentation to make those issues with paths and parameters clearer to users. Hope you can run NanoCLUST with no issues using your own data.

DavidFY-Hub commented 4 years ago

yes, the docker is ok,when run the test

when i use my own data , the same problems with Fahadkhokhar is comging ,

i will do it right now

DavidFY-Hub commented 4 years ago


thank you

genomicsITER commented 4 years ago


Are you running the pipeline in mac os? Try to run it using docker profile and test data

nextflow run -profile test,docker According to the Nextflow documentation:

If you are running Docker on Mac OSX make sure you are mounting your local /Users directory into the Docker VM as explained in this excellent tutorial: How to use Docker on OSX.

PD: We updated the pipeline to avoid the Fahadkhokhar problem with read_correction when using their own data

DavidFY-Hub commented 4 years ago

useing the

--polishing_reads 60 --min_cluster_size 50

the problem is sovling, thank you

fahadkhokhar commented 4 years ago

The first problem you report is due to a typo in the assignment to the default value of --polishing-reads when it is not set up in the command. You can check in conf/test.conf that this value is set to 20 when using the test profile. We've updated the pipeline to fix the typo and set a default value of --polishing-reads to 100 when using no profile confs at all. If you are running the pipeline with your own data we strongly recommend to manually assign --polishing_reads and --min_cluster_size parameters in order to compare pipeline outputs specially at low taxonomic levels such as species.

For db and taxdb parameters, you should write the full path using double quotes: --db "/home/nanoclust_vm/NanoCLUST/db/16S_ribosonal_RNA" --tax "/home/nanoclust_vm/NanoCLUST/db/taxdb/"

According to the Nextflow documentation, you can use -w set the working directory:

nextflow run <script> -w /some/scratch/dir

Thank you for your time and feedback! We've also modified the documentation to make those issues with paths and parameters clearer to users. Hope you can run NanoCLUST with no issues using your own data.

Many thanks for the reply. I can now proceed to the classification step, but there is error using both test and own data set in the classification, even without specifying the --db or --tax paths with the test data:

Error executing process > 'consensus_classification (1)'

Caused by: Process consensus_classification (1) terminated with an error exit status (2)

Command error: BLAST Database error: No alias or index file found for nucleotide database

genomicsITER commented 4 years ago

Hi. I don't know very well what's happening with the classification. The pipeline is working for me on a clean Ubuntu18 VM with the minimum dependencies. Just downloading the db using the exact script inside the NanoCLUST dir:

mkdir db db/taxdb
wget && tar -xzvf 16S_ribosomal_RNA.tar.gz -C db
wget && tar -xzvf taxdb.tar.gz -C db/taxdb

After that you should have the right directory tree with the db and the taxonomy. Then I manually set those in the command specifying:

--db "/home/nanoclust_vm/NanoCLUST/db/16S_ribosonal_RNA" --tax "/home/nanoclust_vm/NanoCLUST/db/taxdb/"

Seems that you may downloaded the db in a different way (resulting in not the same dir structure) or any other location other than the NanoCLUST dir?

I will try using BLAST databases in different systems and paths to make it more flexible.

Thanks again

DavidFY-Hub commented 4 years ago

hello,i want to compare the data to get the different species,and alpha ,beta analysis,

so where i can get the abundance table that like the ”otutab.txt“ (not the rel_abundance),

genomicsITER commented 4 years ago

Hi, HaiyangDu. At this time, we do not have an option to get an OTU table like the otutab command does yet. However, the nanoclust_out.txt file includes the number of reads assigned to the same taxonomic ID so it may not be hard to build a otutab.txt file to get the file for alpha and beta analysis.

We will work on an option to get the exact otutab file to make it easier for users to use NanoCLUST output in downstream analyses that require that file structure.

Thank you for your time and feedback.

DavidFY-Hub commented 4 years ago

Ok,thanks for you reply。

ok,thanks for your reply.

DavidFY-Hub commented 4 years ago

hi, the problems is that run the pipline with 1 sample is perfectct,but my data has 50 samples,it always occurs error ,when i run the 50 samples with the parameter "--reads 'my path/*.fastq" .