PacificBiosciences / HiFi-16S-workflow

Nextflow pipeline to analyze PacBio HiFi full-length 16S data
BSD 3-Clause Clear License
59 stars 15 forks source link

error running 5 Sequel-IIe samples with docker #13

Closed splaisan closed 1 year ago

splaisan commented 1 year ago

HI again,

I ran my first 5 real samples using your nice tool in docker mode (the example run succeeded before this one) The samples are 5 identical Zymo DNA community standard D6305 run with different barcodes in 5 different 16S experiments

I ran this command in the pb-16S-nf folder with more cpu attributed to the jobs since I have 88 threads and 512GB RAM available:

nextflow run main.nf \
  --input /data/analyses/Zymo-SequelIIe-Hifi/run_samples.tsv \
  --metadata /data/analyses/Zymo-SequelIIe-Hifi/run_metadata.tsv \
  --outdir /data/analyses/Zymo-SequelIIe-Hifi/results \
  --dada2_cpu 24 \
  --vsearch_cpu 24 \
  --cutadapt_cpu 48 \
  -profile docker

My input files are attached and inspired from the example files

inputs.tgz

An archive of the work folder with logs is also attached

7545b2ea952a3f15532c70d815c2bc.tgz

Can you please help me correct this issue and run my first real analysis?

Thanks in advance

nextflow run main.nf --input /data/analyses/Zymo-SequelIIe-Hifi/run_samples.tsv --metadata /data/analyses/Zymo-SequelIIe-Hifi/run_metadata.tsv --outdir /data/analyses/Zymo-SequelIIe-Hifi/results -profile docker --dada2_cpu 24 --vsearch_cpu 24 --cutadapt_cpu 48
N E X T F L O W  ~  version 22.10.0
Launching `main.nf` [awesome_chandrasekhar] DSL2 - revision: 6c347af324

  Parameters set for pb-16S-nf pipeline for PacBio HiFi 16S
  =========================================================
  Number of samples in samples TSV: 5
  Filter input reads above Q: 20
  Trim primers with cutadapt: Yes
  Forward primer: AGRGTTYGATYMTGGCTCAG
  Reverse primer: AAGTCGTAACAAGGTARCY
  Minimum amplicon length filtered in DADA2: 1000
  Maximum amplicon length filtered in DADA2: 1600
  maxEE parameter for DADA2 filterAndTrim: 2
  minQ parameter for DADA2 filterAndTrim: 0
  Pooling method for DADA2 denoise process: pseudo
  Minimum number of samples required to keep any ASV: 1
  Minimum number of reads required to keep any ASV: 5 
  Taxonomy sequence database for VSEARCH: /opt/biotools/pb-16S-nf/databases/GTDB_ssu_all_r207.qza
  Taxonomy annotation database for VSEARCH: /opt/biotools/pb-16S-nf/databases/GTDB_ssu_all_r207.taxonomy.qza
  Skip Naive Bayes classification: false
  SILVA database for Naive Bayes classifier: /opt/biotools/pb-16S-nf/databases/silva_nr99_v138.1_wSpecies_train_set.fa.gz
  GTDB database for Naive Bayes classifier: /opt/biotools/pb-16S-nf/databases/GTDB_bac120_arc53_ssu_r207_fullTaxo.fa.gz
  RefSeq + RDP database for Naive Bayes classifier: /opt/biotools/pb-16S-nf/databases/RefSeq_16S_6-11-20_RDPv16_fullTaxo.fa.gz
  VSEARCH maxreject: 100
  VSEARCH maxaccept: 100
  VSEARCH perc-identity: 0.97
  QIIME 2 rarefaction curve sampling depth: null
  Number of threads specified for cutadapt: 48
  Number of threads specified for DADA2: 24
  Number of threads specified for VSEARCH: 24
  Script location for HTML report generation: /opt/biotools/pb-16S-nf/scripts/visualize_biom.Rmd
  Container enabled via docker/singularity: true
  Version of Nextflow pipeline: 0.4

executor >  Local (27)
[52/8a912a] process > pb16S:write_log                      [100%] 1 of 1 ✔
[0d/17c679] process > pb16S:QC_fastq (3)                   [100%] 5 of 5 ✔
[f6/6875c2] process > pb16S:cutadapt (5)                   [100%] 5 of 5 ✔
[55/756ad4] process > pb16S:QC_fastq_post_trim (5)         [100%] 5 of 5 ✔
[11/735784] process > pb16S:collect_QC                     [100%] 1 of 1 ✔
[14/9c8073] process > pb16S:prepare_qiime2_manifest        [100%] 1 of 1 ✔
[fa/9d21ec] process > pb16S:import_qiime2                  [100%] 1 of 1 ✔
[ea/244ecc] process > pb16S:demux_summarize                [100%] 1 of 1 ✔
[4f/2ddff8] process > pb16S:dada2_denoise                  [100%] 1 of 1 ✔
[c5/ed6aaa] process > pb16S:filter_dada2                   [100%] 1 of 1 ✔
[a1/058f87] process > pb16S:dada2_qc (1)                   [100%] 1 of 1 ✔
[eb/131db8] process > pb16S:qiime2_phylogeny_diversity (1) [  0%] 0 of 1
[46/7545b2] process > pb16S:dada2_rarefaction (1)          [  0%] 0 of 1
[26/81a70d] process > pb16S:class_tax                      [  0%] 0 of 1
[02/09068c] process > pb16S:dada2_assignTax                [  0%] 0 of 1
[-        ] process > pb16S:export_biom                    -
executor >  Local (27)
[52/8a912a] process > pb16S:write_log                      [100%] 1 of 1 ✔
[0d/17c679] process > pb16S:QC_fastq (3)                   [100%] 5 of 5 ✔
[f6/6875c2] process > pb16S:cutadapt (5)                   [100%] 5 of 5 ✔
[55/756ad4] process > pb16S:QC_fastq_post_trim (5)         [100%] 5 of 5 ✔
[11/735784] process > pb16S:collect_QC                     [100%] 1 of 1 ✔
[14/9c8073] process > pb16S:prepare_qiime2_manifest        [100%] 1 of 1 ✔
[fa/9d21ec] process > pb16S:import_qiime2                  [100%] 1 of 1 ✔
[ea/244ecc] process > pb16S:demux_summarize                [100%] 1 of 1 ✔
[4f/2ddff8] process > pb16S:dada2_denoise                  [100%] 1 of 1 ✔
[c5/ed6aaa] process > pb16S:filter_dada2                   [100%] 1 of 1 ✔
[a1/058f87] process > pb16S:dada2_qc (1)                   [100%] 1 of 1 ✔
[-        ] process > pb16S:qiime2_phylogeny_diversity (1) -
[46/7545b2] process > pb16S:dada2_rarefaction (1)          [100%] 1 of 1, failed: 1 ✘
[-        ] process > pb16S:class_tax                      -
[-        ] process > pb16S:dada2_assignTax                -
[-        ] process > pb16S:export_biom                    -
[-        ] process > pb16S:barplot_nb                     -
[-        ] process > pb16S:barplot                        -
[-        ] process > pb16S:html_rep                       -
[-        ] process > pb16S:krona_plot                     -
Error executing process > 'pb16S:dada2_rarefaction (1)'

Caused by:
  Process `pb16S:dada2_rarefaction (1)` terminated with an error exit status (1)

Command executed:

  qiime diversity alpha-rarefaction --i-table dada2-ccs_table_filtered.qza     --m-metadata-file run_metadata.tsv     --o-visualization alpha-rarefaction-curves.qzv     --p-min-depth 10 --p-max-depth 36322

Command exit status:
  1

Command output:
  (empty)

Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  QIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.
  Plugin error from diversity:

    All metadata filtered after dropping columns that contained non-categorical data.

  Debug info has been saved to /tmp/qiime2-q2cli-err-nq73n8tw.log

Work dir:
  /opt/biotools/pb-16S-nf/work/46/7545b2ea952a3f15532c70d815c2bc

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
proteinosome commented 1 year ago

Hi Stephane, condition should not be a continuous variable as the error suggests. Can you change the condition into categorical variables such as "control"? There might be a parameter in QIIME I can make use to fix this and I will take a look. Thanks.

splaisan commented 1 year ago

Indeed, I had used the experiment numbers in the condition column, not realizing they were numbers and not strings, I replaced them by control and ran it again, leading to a new issue detailed below.

In my command I have my inputs and outputs on a partition /data/analyses/Zymo-SequelIIe-Hifi distinct from the nextflow partition /opt/biotools/pb-16S-nf.

The new run with conditions set to control went well until the end and failed while generating the html report from the Rmd template and final results.

The failure was

cp: cannot stat '/opt/biotools/pb-16S-nf/scripts/visualize_biom.Rmd': No such file or directory

while copying the visualize_biom.Rmd template from the scripts folder of nextflow to the result folder

nextflow run main.nf --input /data/analyses/Zymo-SequelIIe-Hifi/run_samples.tsv --metadata /data/analyses/Zymo-SequelIIe-Hifi/run_metadata.tsv --outdir /data/analyses/Zymo-SequelIIe-Hifi/results4 -profile docker --dada2_cpu 64 --vsearch_cpu 64 --cutadapt_cpu 64
N E X T F L O W  ~  version 22.10.0
Launching `main.nf` [sleepy_hugle] DSL2 - revision: 6c347af324

  Parameters set for pb-16S-nf pipeline for PacBio HiFi 16S
  =========================================================
  Number of samples in samples TSV: 5
  Filter input reads above Q: 20
  Trim primers with cutadapt: Yes
  Forward primer: AGRGTTYGATYMTGGCTCAG
  Reverse primer: AAGTCGTAACAAGGTARCY
  Minimum amplicon length filtered in DADA2: 1000
  Maximum amplicon length filtered in DADA2: 1600
  maxEE parameter for DADA2 filterAndTrim: 2
  minQ parameter for DADA2 filterAndTrim: 0
  Pooling method for DADA2 denoise process: pseudo
  Minimum number of samples required to keep any ASV: 1
  Minimum number of reads required to keep any ASV: 5 
  Taxonomy sequence database for VSEARCH: /opt/biotools/pb-16S-nf/databases/GTDB_ssu_all_r207.qza
  Taxonomy annotation database for VSEARCH: /opt/biotools/pb-16S-nf/databases/GTDB_ssu_all_r207.taxonomy.qza
  Skip Naive Bayes classification: false
  SILVA database for Naive Bayes classifier: /opt/biotools/pb-16S-nf/databases/silva_nr99_v138.1_wSpecies_train_set.fa.gz
  GTDB database for Naive Bayes classifier: /opt/biotools/pb-16S-nf/databases/GTDB_bac120_arc53_ssu_r207_fullTaxo.fa.gz
  RefSeq + RDP database for Naive Bayes classifier: /opt/biotools/pb-16S-nf/databases/RefSeq_16S_6-11-20_RDPv16_fullTaxo.fa.gz
  VSEARCH maxreject: 100
  VSEARCH maxaccept: 100
  VSEARCH perc-identity: 0.97
  QIIME 2 rarefaction curve sampling depth: null
  Number of threads specified for cutadapt: 64
  Number of threads specified for DADA2: 64
  Number of threads specified for VSEARCH: 64
  Script location for HTML report generation: /opt/biotools/pb-16S-nf/scripts/visualize_biom.Rmd
  Container enabled via docker/singularity: true
  Version of Nextflow pipeline: 0.4

executor >  Local (32)
[b4/955cb3] process > pb16S:write_log                      [100%] 1 of 1 ✔
[a9/80429e] process > pb16S:QC_fastq (3)                   [100%] 5 of 5 ✔
[4f/645346] process > pb16S:cutadapt (5)                   [100%] 5 of 5 ✔
[ac/6fb92a] process > pb16S:QC_fastq_post_trim (5)         [100%] 5 of 5 ✔
[e5/e9de98] process > pb16S:collect_QC                     [100%] 1 of 1 ✔
[28/aa66c5] process > pb16S:prepare_qiime2_manifest        [100%] 1 of 1 ✔
[cf/81dca4] process > pb16S:import_qiime2                  [100%] 1 of 1 ✔
[d5/bdff12] process > pb16S:demux_summarize                [100%] 1 of 1 ✔
executor >  Local (32)
[b4/955cb3] process > pb16S:write_log                      [100%] 1 of 1 ✔
[a9/80429e] process > pb16S:QC_fastq (3)                   [100%] 5 of 5 ✔
[4f/645346] process > pb16S:cutadapt (5)                   [100%] 5 of 5 ✔
[ac/6fb92a] process > pb16S:QC_fastq_post_trim (5)         [100%] 5 of 5 ✔
[e5/e9de98] process > pb16S:collect_QC                     [100%] 1 of 1 ✔
[28/aa66c5] process > pb16S:prepare_qiime2_manifest        [100%] 1 of 1 ✔
[cf/81dca4] process > pb16S:import_qiime2                  [100%] 1 of 1 ✔
[d5/bdff12] process > pb16S:demux_summarize                [100%] 1 of 1 ✔
[d6/e9dd53] process > pb16S:dada2_denoise                  [100%] 1 of 1 ✔
[77/59bcbb] process > pb16S:filter_dada2                   [100%] 1 of 1 ✔
[45/e4ea1e] process > pb16S:dada2_qc (1)                   [100%] 1 of 1 ✔
[fa/cf7381] process > pb16S:qiime2_phylogeny_diversity (1) [100%] 1 of 1 ✔
[77/cf0097] process > pb16S:dada2_rarefaction (1)          [100%] 1 of 1 ✔
[49/a20b2d] process > pb16S:class_tax                      [100%] 1 of 1 ✔
[fc/bfbfc2] process > pb16S:dada2_assignTax                [100%] 1 of 1 ✔
[-        ] process > pb16S:export_biom                    -
[ba/7a7dbd] process > pb16S:barplot_nb (1)                 [100%] 1 of 1 ✔
[-        ] process > pb16S:barplot (1)                    -
[39/434c7f] process > pb16S:html_rep (1)                   [100%] 1 of 1, failed: 1 ✘
[-        ] process > pb16S:krona_plot                     -
Error executing process > 'pb16S:html_rep (1)'

Caused by:
  Process `pb16S:html_rep (1)` terminated with an error exit status (1)

Command executed:

  export R_LIBS_USER="/opt/conda/envs/pb-16S-vis/lib/R/library"
  cp /opt/biotools/pb-16S-nf/scripts/visualize_biom.Rmd visualize_biom.Rmd
  cp /opt/biotools/pb-16S-nf/scripts/import_biom.R import_biom.R
  Rscript -e 'rmarkdown::render("visualize_biom.Rmd", params=list(merged_tax_tab_file="/opt/biotools/pb-16S-nf/work/fc/bfbfc230509df228884409566d85e6/best_tax_merged_freq_tax.tsv", metadata="run_metadata.tsv", sample_file="samplefile.txt", dada2_qc="dada2_qc.tsv", reads_qc="all_samples_seqkit.readstats.tsv", summarised_reads_qc="seqkit.summarised_stats.group_by_samples.tsv", cutadapt_qc="all_samples_cutadapt_stats.tsv", vsearch_tax_tab_file="vsearch_merged_freq_tax.tsv", colorby="condition", bray_mat="bray_curtis_distance_matrix.tsv", unifrac_mat="unweighted_unifrac_distance_matrix.tsv", wunifrac_mat="weighted_unifrac_distance_matrix.tsv", post_trim_readstats="all_samples_seqkit.readstats.post_trim.tsv"), output_dir="./")'

Command exit status:
  1

Command output:
  (empty)

Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  cp: cannot stat '/opt/biotools/pb-16S-nf/scripts/visualize_biom.Rmd': No such file or directory

Work dir:
  /opt/biotools/pb-16S-nf/work/39/434c7fc89fccb5c012de7fa3ad1410

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

On advice from my colleague Kobe I added to the docker block of nextflow.config but this did not help

    BPATH = "/opt/biotools/pb-16S-nf"
    runOptions = "-v $BPATH/scripts:$BPATH/scripts"

I still get the same error after re-running

Mike to you :-)

434c7fc89fccb5c012de7fa3ad1410.tgz

proteinosome commented 1 year ago

Hi Stephane, you mentioned that the test run finished successfully? My suspicion of what you just saw might have something to do with how docker and Nextflow mount the directory and I'm working on reproducing this to confirm my suspicion, but I am just curious that the test run completed without any issue because they should be using the exact same command.

proteinosome commented 1 year ago

I've made some changes to how the container works in the develop branch. Can you try to pull the develop commit:

git checkout develop
git pull

And rerun the samples? You can use the -resume flag so that Nextflow won't rerun everything.

Thanks.

splaisan commented 1 year ago

I did it as clean as I could:

nextflow run main.nf --download_db
N E X T F L O W  ~  version 22.10.0
Launching `main.nf` [magical_brattain] DSL2 - revision: 6c347af324
No input file given to --input!

  Parameters set for pb-16S-nf pipeline for PacBio HiFi 16S
  =========================================================
  Number of samples in samples TSV: 0
  Filter input reads above Q: 20
  Trim primers with cutadapt: Yes
  Forward primer: AGRGTTYGATYMTGGCTCAG
  Reverse primer: AAGTCGTAACAAGGTARCY
  Minimum amplicon length filtered in DADA2: 1000
  Maximum amplicon length filtered in DADA2: 1600
  maxEE parameter for DADA2 filterAndTrim: 2
  minQ parameter for DADA2 filterAndTrim: 0
  Pooling method for DADA2 denoise process: pseudo
  Minimum number of samples required to keep any ASV: 0
  Minimum number of reads required to keep any ASV: 0 
  Taxonomy sequence database for VSEARCH: /opt/biotools/pb-16S-nf_develop/databases/GTDB_ssu_all_r207.qza
  Taxonomy annotation database for VSEARCH: /opt/biotools/pb-16S-nf_develop/databases/GTDB_ssu_all_r207.taxonomy.qza
  Skip Naive Bayes classification: false
  SILVA database for Naive Bayes classifier: /opt/biotools/pb-16S-nf_develop/databases/silva_nr99_v138.1_wSpecies_train_set.fa.gz
  GTDB database for Naive Bayes classifier: /opt/biotools/pb-16S-nf_develop/databases/GTDB_bac120_arc53_ssu_r207_fullTaxo.fa.gz
  RefSeq + RDP database for Naive Bayes classifier: /opt/biotools/pb-16S-nf_develop/databases/RefSeq_16S_6-11-20_RDPv16_fullTaxo.fa.gz
  VSEARCH maxreject: 100
  VSEARCH maxaccept: 100
  VSEARCH perc-identity: 0.97
  QIIME 2 rarefaction curve sampling depth: null
  Number of threads specified for cutadapt: 16
  Number of threads specified for DADA2: 8
  Number of threads specified for VSEARCH: 8
  Script location for HTML report generation: /opt/biotools/pb-16S-nf_develop/scripts/visualize_biom.Rmd
  Container enabled via docker/singularity: false
  Version of Nextflow pipeline: 0.4

[-        ] process > pb16S:download_db -
Creating env using mamba: /opt/biotools/pb-16S-nf_develop/env/qiime2-2022.2-py38-linux-conda.yml [cache /home/luna.kuleuven.be/u0002316/nf_conda/qiime2-2022.2-py38-linux-conda-0f9d5d3ab8a678e45d02e511ce6c0160]

Now, it hangs forever creating the env with a lock file .qiime2-2022.2-py38-linux-conda-0f9d5d3ab8a678e45d02e511ce6c0160.lock in $HOME/nf_conda instead of downloading the databases as requested

I always saw this each time I started over with installing your tool and I each time had to install the qiime env manually using the exact same command as run by the nextflow pipeline (I looked it up in ps aux and replicated it)

It does not seem to have to do with my home folder as the failure is the same with or without editing the nextflow.config as I did before.

Is there a way to install the 3 conda envs manually before doing anything else? Maybe with a separate command or script like for the 'download db'. That way we could make sure all 3 conda envs are done well before proceeding

S

proteinosome commented 1 year ago

Hi Stephane, please add "-profile docker" to the run. The docker config does not need to install any Conda environment.

splaisan commented 1 year ago

OK, I see maybe you should edit your doc which proposes to do this using conda (and your run commands end with a backslash that should be removed and spaces after the other backslashes that interfere with copy pasting them)

BTW, I now added mamba and nextflow to my conda base env to see if this is causing all these troubles and I get exactly the same problem.

It seems that the command that creates the qiime env works quite fast when run manually but hangs when run inside the pipeline (whatever this can mean)

/opt/miniconda3/bin/python /opt/miniconda3/bin/mamba env create --prefix /home/luna.kuleuven.be/u0002316/nf_conda/qiime2-2022.2-py38-linux-conda-0f9d5d3ab8a678e45d02e511ce6c0160 --file /opt/biotools/pb-16S-nf_develop/env/qiime2-2022.2-py38-linux-conda.yml

cheers S

proteinosome commented 1 year ago

I've noticed that, too. It happens on some system and not the others and unfortunately I have not been able to figure out why. Setting useMamba=false will usually resolve that, but the time to build environment will be longer.

Sure, I will improve the documentation. Thanks for the suggestion.

FYI the databases are the same you don't have to keep rerunning the "--download_db" step. Just copy the folder "databases" from the other pipeline folder that you've already downloaded. Jump straight into running with your sample in docker mode, there should not be any need to fiddle with Conda besides installing Nextflow. In fact, if you install Nextflow without Conda you can run everything without touching Conda at all :) Docker runs everything in a container it will not be affected by your Conda environment.

splaisan commented 1 year ago

GOOD! (I will install nextfow system-wide with apt)

this worked for me from clean start:

echo -e "sample-id\tabsolute-filepath\ntest_data\t$(readlink -f test_data/test_1000_reads.fastq.gz)" \
  > test_data/test_sample.tsv
nextflow run main.nf \
  --download_db \
  -profile docker
nextflow run main.nf \
  --input test_data/test_sample.tsv \
  --metadata test_data/test_metadata.tsv \
  --outdir test_results \
  -profile docker 

the run completed after ~6min and all expected files are present in the test_results folder

now testing my own 5 samples data and come back below ;-)

splaisan commented 1 year ago

Not good but different from before, maybe getting to it soon

nextflow run main.nf \
  --input /data/analyses/Zymo-SequelIIe-Hifi/run_samples.tsv \
  --metadata /data/analyses/Zymo-SequelIIe-Hifi/run_metadata.tsv \
  --outdir /data/analyses/Zymo-SequelIIe-Hifi/results \
  --dada2_cpu 80 \
  --vsearch_cpu 80 \
  --cutadapt_cpu 80 \
  -profile docker

BTW, I have 88 threads & 512GB RAM, can you advise ideal cpu values above?

resulting in the below stdout. looks like the Rmarkdown convert command does not see the input files (missing path?)

(base) u0002316@gbw-s-pacbio01:/opt/biotools/pb-16S-nf_develop $ nextflow run main.nf \
  --input /data/analyses/Zymo-SequelIIe-Hifi/run_samples.tsv \
  --metadata /data/analyses/Zymo-SequelIIe-Hifi/run_metadata.tsv \
  --outdir /data/analyses/Zymo-SequelIIe-Hifi/results \
  --dada2_cpu 80 \
  --vsearch_cpu 80 \
  --cutadapt_cpu 80 \
  -profile docker
N E X T F L O W  ~  version 22.10.1
Launching `main.nf` [romantic_agnesi] DSL2 - revision: 6c347af324

  Parameters set for pb-16S-nf pipeline for PacBio HiFi 16S
  =========================================================
  Number of samples in samples TSV: 5
  Filter input reads above Q: 20
  Trim primers with cutadapt: Yes
  Forward primer: AGRGTTYGATYMTGGCTCAG
  Reverse primer: AAGTCGTAACAAGGTARCY
  Minimum amplicon length filtered in DADA2: 1000
  Maximum amplicon length filtered in DADA2: 1600
  maxEE parameter for DADA2 filterAndTrim: 2
  minQ parameter for DADA2 filterAndTrim: 0
  Pooling method for DADA2 denoise process: pseudo
  Minimum number of samples required to keep any ASV: 1
  Minimum number of reads required to keep any ASV: 5 
  Taxonomy sequence database for VSEARCH: /opt/biotools/pb-16S-nf_develop/databases/GTDB_ssu_all_r207.qza
  Taxonomy annotation database for VSEARCH: /opt/biotools/pb-16S-nf_develop/databases/GTDB_ssu_all_r207.taxonomy.qza
  Skip Naive Bayes classification: false
  SILVA database for Naive Bayes classifier: /opt/biotools/pb-16S-nf_develop/databases/silva_nr99_v138.1_wSpecies_train_set.fa.gz
  GTDB database for Naive Bayes classifier: /opt/biotools/pb-16S-nf_develop/databases/GTDB_bac120_arc53_ssu_r207_fullTaxo.fa.gz
  RefSeq + RDP database for Naive Bayes classifier: /opt/biotools/pb-16S-nf_develop/databases/RefSeq_16S_6-11-20_RDPv16_fullTaxo.fa.gz
  VSEARCH maxreject: 100
  VSEARCH maxaccept: 100
  VSEARCH perc-identity: 0.97
  QIIME 2 rarefaction curve sampling depth: null
  Number of threads specified for cutadapt: 80
  Number of threads specified for DADA2: 80
  Number of threads specified for VSEARCH: 80
  Script location for HTML report generation: /opt/biotools/pb-16S-nf_develop/scripts/visualize_biom.Rmd
  Container enabled via docker/singularity: true
  Version of Nextflow pipeline: 0.4

executor >  Local (32)
[ed/e2e039] process > pb16S:write_log                      [100%] 1 of 1 ✔
[29/9f8dbc] process > pb16S:QC_fastq (3)                   [100%] 5 of 5 ✔
[87/2acfb2] process > pb16S:cutadapt (5)                   [100%] 5 of 5 ✔
[50/4240c9] process > pb16S:QC_fastq_post_trim (5)         [100%] 5 of 5 ✔
[30/729570] process > pb16S:collect_QC                     [100%] 1 of 1 ✔
executor >  Local (32)
[ed/e2e039] process > pb16S:write_log                      [100%] 1 of 1 ✔
[29/9f8dbc] process > pb16S:QC_fastq (3)                   [100%] 5 of 5 ✔
[87/2acfb2] process > pb16S:cutadapt (5)                   [100%] 5 of 5 ✔
[50/4240c9] process > pb16S:QC_fastq_post_trim (5)         [100%] 5 of 5 ✔
[30/729570] process > pb16S:collect_QC                     [100%] 1 of 1 ✔
[93/14eed9] process > pb16S:prepare_qiime2_manifest        [100%] 1 of 1 ✔
[ce/c2b9b5] process > pb16S:import_qiime2                  [100%] 1 of 1 ✔
[ef/6b663f] process > pb16S:demux_summarize                [100%] 1 of 1 ✔
[50/44172d] process > pb16S:dada2_denoise                  [100%] 1 of 1 ✔
[24/1cd2c0] process > pb16S:filter_dada2                   [100%] 1 of 1 ✔
[32/ce8bc1] process > pb16S:dada2_qc (1)                   [100%] 1 of 1 ✔
[7d/2e2397] process > pb16S:qiime2_phylogeny_diversity (1) [100%] 1 of 1 ✔
[a5/2972ed] process > pb16S:dada2_rarefaction (1)          [100%] 1 of 1 ✔
[91/8e82f5] process > pb16S:class_tax                      [100%] 1 of 1 ✔
[38/d1e917] process > pb16S:dada2_assignTax                [100%] 1 of 1 ✔
[-        ] process > pb16S:export_biom                    -
[41/4ac036] process > pb16S:barplot_nb (1)                 [100%] 1 of 1 ✔
[-        ] process > pb16S:barplot (1)                    -
[ad/de7db7] process > pb16S:html_rep (1)                   [100%] 1 of 1, failed: 1 ✘
[-        ] process > pb16S:krona_plot                     -
Error executing process > 'pb16S:html_rep (1)'

Caused by:
  Process `pb16S:html_rep (1)` terminated with an error exit status (1)

Command executed:

  export R_LIBS_USER="/opt/conda/envs/pb-16S-vis/lib/R/library"
  cp /opt/biotools/pb-16S-nf_develop/scripts/visualize_biom.Rmd visualize_biom.Rmd
  cp /opt/biotools/pb-16S-nf_develop/scripts/import_biom.R import_biom.R
  Rscript -e 'rmarkdown::render("visualize_biom.Rmd", params=list(merged_tax_tab_file="/opt/biotools/pb-16S-nf_develop/work/38/d1e917ae6103a78cb57f1deea8eb8a/best_tax_merged_freq_tax.tsv", metadata="run_metadata.tsv", sample_file="samplefile.txt", dada2_qc="dada2_qc.tsv", reads_qc="all_samples_seqkit.readstats.tsv", summarised_reads_qc="seqkit.summarised_stats.group_by_samples.tsv", cutadapt_qc="all_samples_cutadapt_stats.tsv", vsearch_tax_tab_file="vsearch_merged_freq_tax.tsv", colorby="condition", bray_mat="bray_curtis_distance_matrix.tsv", unifrac_mat="unweighted_unifrac_distance_matrix.tsv", wunifrac_mat="weighted_unifrac_distance_matrix.tsv", post_trim_readstats="all_samples_seqkit.readstats.post_trim.tsv"), output_dir="./")'

Command exit status:
  1

Command output:
  (empty)

Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  cp: cannot stat '/opt/biotools/pb-16S-nf_develop/scripts/visualize_biom.Rmd': No such file or directory

Work dir:
  /opt/biotools/pb-16S-nf_develop/work/ad/de7db7e35fb422dafba170919b01d0

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Thanks!

proteinosome commented 1 year ago
  cp /opt/biotools/pb-16S-nf_develop/scripts/visualize_biom.Rmd visualize_biom.Rmd
  cp /opt/biotools/pb-16S-nf_develop/scripts/import_biom.R import_biom.R

These should not exist anymore in the develop branch (You can check in the repo here directly in main.nf under the html_rep process). Can you check main.nf in your workflow folder (pb_16s_develop) and make sure it's using the correct branch? You should not see the cp command in the html_rep process.

splaisan commented 1 year ago

my very bad, I probably cloned the main and named it develop I start over

splaisan commented 1 year ago

YES! It ran!

Very sorry about overlooking the cloning process, I forgot to add -p develop as I thought that the copied url would point to the develop build automatically.

Note that I also meanwhile installed nextflow system-wide and stopped using conda altogether in the process (full docker + system NF)

Thanks you very much for your continuous help and correcting my flows, this took quite some time but my 5 repeats of the Zymo sample give very similar results which are kind-of agreeing with the Zymo info besides strong bias for the abundance for some.

BTW: It also worked with data IO on the distant share

I will now hopefully deepen my knowledge of Q2 and get to know the other data produced by your nice (and fast) pipeline.

Cheers from Belgium!

proteinosome commented 1 year ago

Hi Stephanie, happy to hear that it worked for you. Sorry for all the troubles too. I'll reflect on your experience and see how I can make it more robust. I'm still puzzled by why Conda did not work for you but I'm glad Docker is working well.

In general I use only 32 CPUs for the cpu parameters and found it to be adequate. It can become very slow at high read depth because DADA2 pools all the reads for denoising. I can probably make it faster by letting it denoise samples individually but that can drop the very low abundance bacteria. In a future version I'll make that an option, thou.

You mentioned that you did Zymo. May I know which community did you sequence? I'm curious about the abundance bias. I've generally observed high correlation of abundance in all of the communities we've sequenced. Zymo D6323 in particular even showed high concordance with shotgun metagenomics.

splaisan commented 1 year ago

Thanks Khi Pin,

We add the old Zymo microbial community (D6305) in each run as extra barcode to be able to spot technical issues with library prep. I used here the Zymo samples from 5 unrelated runs and get the following results

Screenshot 2022-11-09 at 13 52 25

Screenshot 2022-11-09 at 13 53 24

close but not identical to the expected equimolar representation.

Screenshot 2022-11-09 at 11 47 00

I think we can be happy with the correlation and the database has very likely some effect on the findings.

Cheers, Stephane

proteinosome commented 1 year ago

Hi Stephane, I assume that the abundance is whole microbe abundance. Some of the microbes have more copies of 16S so Zymo also provide the abundances after normalising for 16S copies. I. e. If you sequence 2 bacteria with metagenomics they may be 50:50, but if one bacteria has one 16S copy and the other has 2 16S copies, you actually have 25:75 in 16S abundance. I think if you normalise to that the abundance should be even more correlated :) For D6305 it's documented at page 1 of the protocol PDF here https://www.zymoresearch.com/collections/zymobiomics-microbial-community-standards/products/zymobiomics-microbial-community-dna-standard