RVanDamme / MUFFIN

hybrid assembly and differential binning workflow for metagenomics, transcriptomics and pathway analysis
https://rvandamme.github.io/MUFFIN_Documentation/#introduction
GNU General Public License v3.0
65 stars 11 forks source link

Unicycler and minor change #7

Closed RemiMaglione closed 4 years ago

RemiMaglione commented 4 years ago

Hi RVanDamme team,

1- Question : does the Unicycler pipeline automatically launch with the --assembler metaspades option or does it has its own optional parameter (the hybrid reassembly is pointed as optional on the workflow) ?

2- Minor change: in the installation guide at the create env step, it lack the 'create' in conda create -y -p /path/to/install/metawrap-env python=2.7

thanks for that pipeline Best

RemiMaglione commented 4 years ago

Other minor changes 1- lot's of error from the step 2. Go through this by inactivated some lines (with //) in MAFIN/main.nf, because values was already declared : line 432: // include './modules/sourmashgetdatabase' line 433: // sourmash_download_db() line 453: // include './modules/checkmsetupDB' line 454: // include './modules/checkmgetdatabases' line 456:// checkm_setup_db(checkm_download_db(), untar)

2- As suggested in the Installation --nanopore for nanopore file(s) path doesn't work. Only --ont worked so far for me

[More going on]

replikation commented 4 years ago

Hello, thanks for writing an issue. can you tell us also the full command that you used? we will look into that :)

RemiMaglione commented 4 years ago

eggnog_download_db' failed because download_eggnog_data.py not found. I don't know if it's an issue from my side but I download and install it on the metwarp-env with: conda install -c bioconda eggnog-mapper Since I'm kind of new in the conda env dynamics,even if the eggnog install went well, the download_eggnog_data.py are now on my env path, but the main.nf keep failing at that step. I had to manually download the eggnog db by running: nextflow run /path/to/my/MAFIN/modules/eggnog_get_databases.nf [Which worked, proving that download_eggnog_data.py are accessible for this module but now for the main.nf (???)]

RemiMaglione commented 4 years ago

Hello, thanks for writing an issue. can you tell us also the full command that you used? we will look into that :)

So far nextflow run /path/to/my/MAFIN/main.nf --output /path/to/my/output --assembler metaspades --illumina /path/to/my/illumina.fastq --ont /path/to/my/ont.fastq --core 60 --memory 500g -profile conda

[I'll continue to post every issues/solutions i'm running to]

replikation commented 4 years ago

yes, thank you very much. conda is always tricky to set up and work perfectly sadly.

RemiMaglione commented 4 years ago

checkm_setup_db steps appears unusually long (I checked the code of both checkmgetdatabases.nfand checkmsetupDB.nf and it looks like a download and installation of the checkm database: it should not took more than 2 hours, or should it ?) So I kill the main command and manually launch the checkmgetdatabases with nextflow run checkmgetdatabases.nf and now the pipeline move to the fastp step. [Question]: in the checkm.nf code I found the parameter ${task.cpus} and when looking at the main installation this trigger a misunderstanding: How do we have to parse "threads" in the main command:

  1. with --core (as suggested in the Usage section) or
  2. --cpus (as suggested in the Complete help and options section) ?
RemiMaglione commented 4 years ago

Pipeline failed during the spades step, but look like it's an issue from checkm_setup_db when it failed to create a conda env

executor > local (28) executor > local (28) [f3/f00282] process > sourmash_download_db [100%] 1 of 1 ✔ [23/8a24f1] process > checkm_download_db [100%] 1 of 1 ✔ [- ] process > checkm_setup_db - [09/567fe2] process > discard_short (22) [100%] 22 of 22 ✔ [e4/f0da69] process > merge (1) [100%] 1 of 1 ✔ [0b/fef8ba] process > fastp (1) [100%] 1 of 1 ✔ [bf/42661e] process > spades (1) [100%] 1 of 1, failed: 1 [- ] process > minimap2 - [- ] process > bwa - [- ] process > metabat2 - [- ] process > maxbin2 - [- ] process > concoct - [- ] process > refine3 - [- ] process > checkm - [- ] process > sourmash_bins - [- ] process > sourmash_checkm_parser - [24/db07e2] process > eggnog_download_db [100%] 1 of 1 ✔ [- ] process > eggnog_bin - [- ] process > parser_bin - Oops .. something went wrong WARN: Killing pending tasks (1) Error executing process > 'checkm_setup_db' Caused by: Failed to create Conda environment command: conda create --mkdir --yes --quiet --prefix /path/to/my/output/nextflow-autodownload-databases/checkm/db/work/conda/env-f158ef0f26abfac27f08a061ab129d86 bioconda::checkm-genome status : 120 message:

This is where I went so far

RVanDamme commented 4 years ago

Hello, First of all, could you tell me which version are you using? the master branch, the legacy 0.1 ?

Second for the 2 first question:

  1. the parameter to activate the re-assembly using unicycler is " --reassembly " (highly unstable for now)
  2. Thanks for pointing it out.

Third for the other minor changes:

  1. Indeed there is a redundancy that was left as you can use only the second step without the first (that already load the DBs).

  2. fixed the readme

  3. For the Eggnog download database issue, please open a new issue. it seems indeed that the main script can't access the file and download the database. I will work on that as soon as possible.

  4. in the command you use (nextflow run /path/to/my/MAFIN/main.nf --output /path/to/my/output --assembler metaspades --illumina /path/to/my/illumina.fastq --ont /path/to/my/ont.fastq --core 60 --memory 500g -profile conda) you should not specify the files themselves (illumina.fastq and ont.fastq) but the directory containing them. The file should also have the same "basename" (e.g. SR002_R1.fastq, SR002_R2.fastq for illumina and SR002.fastq for nanopore) this is to avoid any further issues in the analysis (and that's probably one source of the spades error)

  5. Checkm indeed shouldn't take too long, I invite you to open another issue for checkm only to keep everything clear. Do you know which of the 2 processes (download and setup) was the one taking so long?

  6. The usage is --cpus the --cores is as the --nanopore an echo of a development phase

For the checkm/spades error, it is possible that both checkm setup and spades got an error and in that case nextflow only report one graphically. Could you upload the .nextflow.log file (it's in the directory where you executed your nextflow command) as well as the .command.err .command.sh .command.log of the checkm and spades process? the .command files are in the respective working directory of setupcheckm and spades /work/??/??????!!!!!!!!!/.command.err where the ? represent the process IDs ( here spades is bf/42661e) and the ! are the continuation of the directory name (a simple tab press should avoid you typing it).

Besides the 2 bigger issues (eggnog and checkm/spades) all the others are either answered here or pushed on the latest version of master. If you need more detailed answers or find new issues feel free to post and we'll answer as soon as possible.

RemiMaglione commented 4 years ago

Hello,

First of all, could you tell me which version are you using? the master branch, the legacy 0.1 ?

I think yes (I downloaded MAFIN with git clone https://github.com/RVanDamme/MAFIN.git last week)

For the Eggnog download database issue, please open a new issue.

Done

in the command you use (nextflow run /path/to/my/MAFIN...

Sorry, my mistakes, I truly provide only the path, not the path+the file as mentioned in this issue.

Checkm indeed shouldn't take too long, I invite you to open another issue for checkm only to keep everything clear

Done

Do you know which of the 2 processes (download and setup) was the one taking so long?

Unfortunately not and I quickly went through this problem since I did it manually right away.

Could you upload the .nextflow.log

.nextflow.log

Could you upload the .nextflow.log file (it's in the directory where you executed your nextflow command) as well as the .command.err .command.sh .command.log of the checkm and spades process?

  1. A strange thing is that the 'work' folder I had on my output folder was from a previous attempt. After digging a bit in the subfolder, I found that the actual work folder we are interested in fall under /path/to/my/output/nextflow-autodownload-databases/checkm/db/work/
  1. On that folder I did find the Spades "process" folder but not the .command.err .command.sh .command.log files. I have 3 symbolic link pointing clean.fastq files (probably yielded by the fastp step) and a spades_output folder (containing:configs corrected dataset.info input_dataset.yaml K21 K33 K55 misc params.txt pipeline_state run_spades.sh run_spades.yaml spades.log tmp)

  2. On that work folder, the checkm '23' folder process was empty and I go over all the other work subfoler and nothing look like .command.* checkm log... Sorry

Thank you for all your answers

RVanDamme commented 4 years ago
1. A strange thing is that the 'work' folder I had on my output folder was from a previous attempt. After digging a bit in the subfolder, I found that the actual work folder we are interested in fall under `/path/to/my/output/nextflow-autodownload-databases/checkm/db/work/`

Nextflow creates the 'work ' directory where you run the command, in this case, you probably run your command while you had /path/to/my/output/nextflow-autodownload-databases/checkm/db/work/ as current directory

2. On that folder I did find the Spades "process" folder but not the .command.err .command.sh .command.log files. I have 3 symbolic link pointing clean.fastq files (probably yielded by the fastp step) and a spades_output folder (containing:` configs  corrected  dataset.info  input_dataset.yaml  K21  K33  K55  misc  params.txt  pipeline_state  run_spades.sh  run_spades.yaml  spades.log  tmp`)

in the folder did you run an ls -a? if no you should run it to find the files. If yes it probably means that nextflow pass the files to run spades but didn't start the process or was killed before starting

3. On that work folder, the checkm '23' folder process was empty and I go over all the other work subfolder and nothing look like .command.* checkm log... Sorry

the .command.* is not present in the subfolder of checkm but present in the folder that contains the whole process ( e.g. work/??/????????/.command.err). If you are in that directory (or you put the path in the command) just do an ls -a or ls -ltra and you should see the files. If the files are still not there it is a weird issue link to nextflow behavior.

RemiMaglione commented 4 years ago

I finally found the files: spades_step_commandX.zip

Another thing: I tried to run Spades by myself. It crash at the error-correction step. Now, I wonder if the Spades issue comes from my side: I work with kind of huge files yielded by a Novaseq sequencing (100M reads per sample minimum). I'm running everything on a server that have 64 cores, 500 Go RAM. I'll continue to debug that step on my side, but the crash may have occurred due to lack of RAM

replikation commented 4 years ago

@RemiMaglione can you give us the .nextflow.log file too? its directly in the working dir where you executed nextflow (its saves the last 10 runs in 10 files, e.g. .nextflow.log.1) newest is .nextflow.log oldest .nextflow.log.9

replikation commented 4 years ago

@ram error its usually exit code 146 or so if its a RAM issue if you are using nextflow at least.

RemiMaglione commented 4 years ago

can you give us the .nextflow.log file too?

Sure: .nextflow.log

replikation commented 4 years ago

@RVanDamme i think i found the error:

€ thx for the log, helped to figure out the error

RVanDamme commented 4 years ago

Hello @RemiMaglione, The version 1.0.0 of MUFFIN just got released. This should solve most of the issue you faced while trying the pre-release. If you face issue identical to the one reported here or new issue please feel free to open new issues and report them to us. I will close this issue for now.