MaestSi / MetONTIIME

A Meta-barcoding pipeline for analysing ONT data in QIIME2 framework
GNU General Public License v3.0
78 stars 17 forks source link

run fails on fastq data #2

Closed splaisan closed 5 years ago

splaisan commented 5 years ago

Hi Simone, I post on your git rather than on my thread

I tried to launch the analysis from my three fq.gz files as you suggested after I renamed them BC01.fq.gz BC02.fq.gz, and BC03.fq.gz but it fails at many levels.

Let me summarize what I tried here:

edited config_MinION_mobile_lab.R to reflect my env

PIPELINE_DIR: /opt/biotools/MetONTIIME
MINICONDA_DIR: /opt/biotools/miniconda3/
#basecaller_dir
BASECALLER_DIR <- "/opt/ont/guppy/bin"
#NCBI-downloaded sequences (QIIME2 artifact)
#DB <- "/path/to/PRJNA33175_Bacterial_sequences.qza"
DB <- "/opt/biotools/MetONTIIME/PRJNA33175_Bacterial_sequences_sequence.qza"
#Taxonomy of NCBI-downloaded sequences (QIIME2 artifact)
#TAXONOMY <- "/path/to/PRJNA33175_taxonomy.qza"
TAXONOMY <- "/opt/biotools/MetONTIIME/PRJNA33175_Bacterial_sequences_taxonomy.qza"
#sample-metadata file describing samples metadata; it is created automatically if it doesn't exist, but the path should exist
SAMPLE_METADATA <- "/data2/analyses/MetONTIIME/sample-metadata.tsv"

all files/paths above exist except "/data2/analyses/MetONTIIME/sample-metadata.tsv"

renamed my three fastq.gz insilico PCR extracts (not fast5 but fastq right?) and put them in /data2/analyses/MetONTIIME/ = workdir

total 1.2G
-rw-r--r--  1 u0002316 domain users 384M Sep 27 16:07 BC01.fq.gz
-rw-r--r--  1 u0002316 domain users 100M Sep 27 16:07 BC02.fq.gz
-rw-r--r--  1 u0002316 domain users 654M Sep 27 16:07 BC03.fq.gz

ran the script from the MetONTIIME install folder which contains:

(MetONTIIME_env) u0002316@gbw-s-pacbio01:/opt/biotools/MetONTIIME$ ll
total 58M
drwxr-xr-x   6 u0002316 domain users 4.0K Sep 27 16:15 .
drwxrwxr-x 219 u0002316 domain users  12K Sep 26 09:41 ..
-rwxr-xr-x   1 u0002316 domain users 4.8K Sep 26 10:05 config_MinION_mobile_lab.R
drwxr-xr-x   3 u0002316 domain users 4.0K Sep 26 10:00 entrez_qiime
-rwxr-xr-x   1 u0002316 domain users 4.4K Sep 26 09:41 Evaluate_diversity.sh
drwxr-xr-x   8 u0002316 domain users 4.0K Sep 26 09:41 .git
-rwxr-xr-x   1 u0002316 domain users 2.0K Sep 26 09:41 Import_database.sh
-rwxr-xr-x   1 u0002316 domain users 1.4K Sep 26 09:41 install.sh
-rwxr-xr-x   1 u0002316 domain users 1014 Sep 26 09:41 Launch_MinION_mobile_lab.sh
-rwxr-xr-x   1 u0002316 domain users 3.9K Sep 26 09:41 MetONTIIME.sh
-rwxr-xr-x   1 u0002316 domain users  19K Sep 26 09:41 MinION_mobile_lab.R
-rw-r--r--   1 u0002316 domain users 2.2M Sep 16 15:08 PRJNA33175_Bacterial_sequences_accession_taxonomy.txt
-rw-r--r--   1 u0002316 domain users  31M Sep 16 13:50 PRJNA33175_Bacterial_sequences.fasta
-rw-r--r--   1 u0002316 domain users  490 Sep 16 15:08 PRJNA33175_Bacterial_sequences.log
-rw-r--r--   1 u0002316 domain users 5.8M Sep 16 15:12 PRJNA33175_Bacterial_sequences_sequence.qza
-rw-r--r--   1 u0002316 domain users 386K Sep 16 15:12 PRJNA33175_Bacterial_sequences_taxonomy.qza
-rwxr-xr-x   1 u0002316 domain users 9.6K Sep 26 09:41 README.md
-rwxr-xr-x   1 u0002316 domain users 1007 Sep 26 09:41 subsample_fast5.sh
drwxr-xr-x   3 u0002316 domain users 4.0K Sep 26 10:00 taxonomy
drwxr-xr-x   2 u0002316 domain users 4.0K Sep  8 19:48 Test_BC04_FLO-FLG001_SQK-RAB204
-rwxr-xr-x   1 u0002316 domain users  19M Sep 26 09:41 Test_BC04_FLO-FLG001_SQK-RAB204.zip
-rwxr-xr-x   1 u0002316 domain users   19 Sep 26 09:41 version.txt
(MetONTIIME_env) u0002316@gbw-s-pacbio01:/opt/biotools/MetONTIIME$ 

my command is:

(MetONTIIME_env) /opt/biotools/MetONTIIME$./MetONTIIME.sh /data2/analyses/MetONTIIME /data2/analyses/MetONTIIME/sample-metadata.tsv PRJNA33175_Bacterial_sequences_sequence.qza PRJNA33175_Bacterial_sequences_taxonomy.qza 84

Do you see what I did wrong? Thanks

The long error log

realpath: missing operand
Try 'realpath --help' for more information.
There was a problem importing /data2/analyses/MetONTIIME/manifest.txt:

  /data2/analyses/MetONTIIME/manifest.txt is not a(n) SingleEndFastqManifestPhred33V2 file:

  There was an issue with loading the metadata file:

  Metadata must contain at least one ID.

  There may be more errors present in the metadata file. To get a full report, sample/feature metadata files can be validated with Keemei: https://keemei.qiime2.org

  Find details on QIIME 2 metadata requirements here: https://docs.qiime2.org/2019.7/tutorials/metadata/

Usage: qiime vsearch dereplicate-sequences [OPTIONS]

  Dereplicate sequence data and create a feature table and feature
  representative sequences. Feature identifiers in the resulting artifacts
  will be the sha1 hash of the sequence defining each feature. If clustering
  of features into OTUs is desired, the resulting artifacts can be passed to
  the cluster_features_* methods in this plugin.

Inputs:
  --i-sequences ARTIFACT SampleData[Sequences] |
    SampleData[SequencesWithQuality] | SampleData[JoinedSequencesWithQuality]
                          The sequences to be dereplicated.         [required]
Parameters:
  --p-derep-prefix / --p-no-derep-prefix
                          Merge sequences with identical prefixes. If a
                          sequence is identical to the prefix of two or more
                          longer sequences, it is clustered with the shortest
                          of them. If they are equally long, it is clustered
                          with the most abundant.             [default: False]
Outputs:
  --o-dereplicated-table ARTIFACT FeatureTable[Frequency]
                          The table of dereplicated sequences.      [required]
  --o-dereplicated-sequences ARTIFACT FeatureData[Sequence]
                          The dereplicated sequences.               [required]
Miscellaneous:
  --output-dir PATH       Output unspecified results to a directory
  --verbose / --quiet     Display verbose output to stdout and/or stderr
                          during execution of this action. Or silence output
                          if execution is successful (silence is golden).
  --citations             Show citations and exit.
  --help                  Show this message and exit.

                    There was a problem with the command:                     
 (1/1) Invalid value for "--i-sequences": 'sequences.qza' is not a valid
  filepath
Usage: qiime vsearch cluster-features-de-novo [OPTIONS]

  Given a feature table and the associated feature sequences, cluster the
  features based on user-specified percent identity threshold of their
  sequences. This is not a general-purpose de novo clustering method, but
  rather is intended to be used for clustering the results of quality-
  filtering/dereplication methods, such as DADA2, or for re-clustering a
  FeatureTable at a lower percent identity than it was originally clustered
  at. When a group of features in the input table are clustered into a
  single feature, the frequency of that single feature in a given sample is
  the sum of the frequencies of the features that were clustered in that
  sample. Feature identifiers and sequences will be inherited from the
  centroid feature of each cluster. See the vsearch documentation for
  details on how sequence clustering is performed.

Inputs:
  --i-sequences ARTIFACT FeatureData[Sequence]
                          The sequences corresponding to the features in
                          table.                                    [required]
  --i-table ARTIFACT FeatureTable[Frequency]
                          The feature table to be clustered.        [required]
Parameters:
  --p-perc-identity PROPORTION Range(0, 1, inclusive_start=False,
    inclusive_end=True)   The percent identity at which clustering should be
                          performed. This parameter maps to vsearch's --id
                          parameter.                                [required]
  --p-threads INTEGER Range(0, 256, inclusive_end=True)
                          The number of threads to use for computation.
                          Passing 0 will launch one thread per CPU core.
                                                                  [default: 1]
Outputs:
  --o-clustered-table ARTIFACT FeatureTable[Frequency]
                          The table following clustering of features.
                                                                    [required]
  --o-clustered-sequences ARTIFACT FeatureData[Sequence]
                          Sequences representing clustered features.
                                                                    [required]
Miscellaneous:
  --output-dir PATH       Output unspecified results to a directory
  --verbose / --quiet     Display verbose output to stdout and/or stderr
                          during execution of this action. Or silence output
                          if execution is successful (silence is golden).
  --citations             Show citations and exit.
  --help                  Show this message and exit.

                  There were some problems with the command:                  
 (1/2) Invalid value for "--i-sequences": 'rep-seqs_tmp.qza' is not a valid
  filepath
 (2/2) Invalid value for "--i-table": 'table_tmp.qza' is not a valid filepath
rm: cannot remove 'table_tmp.qza': No such file or directory
rm: cannot remove 'rep-seqs_tmp.qza': No such file or directory
Usage: qiime demux summarize [OPTIONS]

  Summarize counts per sample for all samples, and generate interactive
  positional quality plots based on `n` randomly selected sequences.

Inputs:
  --i-data ARTIFACT SampleData[SequencesWithQuality |
    PairedEndSequencesWithQuality | JoinedSequencesWithQuality]
                       The demultiplexed sequences to be summarized.
                                                                    [required]
Parameters:
  --p-n INTEGER        The number of sequences that should be selected at
                       random for quality score plots. The quality plots will
                       present the average positional qualities across all of
                       the sequences selected. If input sequences are paired
                       end, plots will be generated for both forward and
                       reverse reads for the same `n` sequences.
                                                              [default: 10000]
Outputs:
  --o-visualization VISUALIZATION
                                                                    [required]
Miscellaneous:
  --output-dir PATH    Output unspecified results to a directory
  --verbose / --quiet  Display verbose output to stdout and/or stderr during
                       execution of this action. Or silence output if
                       execution is successful (silence is golden).
  --citations          Show citations and exit.
  --help               Show this message and exit.

                    There was a problem with the command:                     
 (1/1) Invalid value for "--i-data": 'sequences.qza' is not a valid filepath
There was an issue with loading the file /data2/analyses/MetONTIIME/sample-metadata.tsv as metadata:

  There was an issue with loading the metadata file:

  Metadata must contain at least one ID.

  There may be more errors present in the metadata file. To get a full report, sample/feature metadata files can be validated with Keemei: https://keemei.qiime2.org

  Find details on QIIME 2 metadata requirements here: https://docs.qiime2.org/2019.7/tutorials/metadata/

Usage: qiime feature-table tabulate-seqs [OPTIONS]

  Generate tabular view of feature identifier to sequence mapping, including
  links to BLAST each sequence against the NCBI nt database.

Inputs:
  --i-data ARTIFACT FeatureData[Sequence]
                       The feature sequences to be tabulated.       [required]
Outputs:
  --o-visualization VISUALIZATION
                                                                    [required]
Miscellaneous:
  --output-dir PATH    Output unspecified results to a directory
  --verbose / --quiet  Display verbose output to stdout and/or stderr during
                       execution of this action. Or silence output if
                       execution is successful (silence is golden).
  --citations          Show citations and exit.
  --help               Show this message and exit.

                    There was a problem with the command:                     
 (1/1) Invalid value for "--i-data": 'rep-seqs.qza' is not a valid filepath
Usage: qiime feature-classifier classify-consensus-blast [OPTIONS]

  Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local
  alignment between query and reference_reads, then assigns consensus
  taxonomy to each query sequence from among maxaccepts hits, min_consensus
  of which share that taxonomic assignment. Note that maxaccepts selects the
  first N hits with > perc_identity similarity to query, not the top N
  matches. For top N hits, use classify-consensus-vsearch.

Inputs:
  --i-query ARTIFACT FeatureData[Sequence]
                          Sequences to classify taxonomically.      [required]
  --i-reference-reads ARTIFACT FeatureData[Sequence]
                          reference sequences.                      [required]
  --i-reference-taxonomy ARTIFACT FeatureData[Taxonomy]
                          reference taxonomy labels.                [required]
Parameters:
  --p-maxaccepts INTEGER  Maximum number of hits to keep for each query. Must
    Range(1, None)        be in range [1, infinity]. BLAST will choose the
                          first N hits in the reference database that exceed
                          perc-identity similarity to query.     [default: 10]
  --p-perc-identity PROPORTION Range(0.0, 1.0, inclusive_end=True)
                          Reject match if percent identity to query is lower.
                          Must be in range [0.0, 1.0].          [default: 0.8]
  --p-query-cov PROPORTION Range(0.0, 1.0, inclusive_end=True)
                          Reject match if query alignment coverage per
                          high-scoring pair is lower. Note: this uses blastn's
                          qcov_hsp_perc parameter, and may not behave
                          identically to the query-cov parameter used by
                          classify-consensus-vsearch. Must be in range [0.0,
                          1.0].                                 [default: 0.8]
  --p-strand TEXT Choices('both', 'plus', 'minus')
                          Align against reference sequences in forward
                          ("plus"), reverse ("minus"), or both directions
                          ("both").                          [default: 'both']
  --p-evalue NUMBER       BLAST expectation value (E) threshold for saving
                          hits.                               [default: 0.001]
  --p-min-consensus NUMBER Range(0.5, 1.0, inclusive_start=False,
    inclusive_end=True)   Minimum fraction of assignments must match top hit
                          to be accepted as consensus assignment. Must be in
                          range (0.5, 1.0].                    [default: 0.51]
  --p-unassignable-label TEXT
                                                       [default: 'Unassigned']
Outputs:
  --o-classification ARTIFACT FeatureData[Taxonomy]
                          Taxonomy classifications of query sequences.
                                                                    [required]
Miscellaneous:
  --output-dir PATH       Output unspecified results to a directory
  --verbose / --quiet     Display verbose output to stdout and/or stderr
                          during execution of this action. Or silence output
                          if execution is successful (silence is golden).
  --citations             Show citations and exit.
  --help                  Show this message and exit.

                  There were some problems with the command:                  
 (1/3) Invalid value for "--i-query": 'rep-seqs.qza' is not a valid filepath
 (2/3) Invalid value for "--i-reference-reads":
  'PRJNA33175_Bacterial_sequences_sequence.qza' is not a valid filepath
 (3/3) Invalid value for "--i-reference-taxonomy":
  'PRJNA33175_Bacterial_sequences_taxonomy.qza' is not a valid filepath
There was an issue with loading the file taxonomy.qza as metadata:

  Metadata file path doesn't exist, or the path points to something other than a file. Please check that the path exists, has read permissions, and points to a regular file (not a directory): taxonomy.qza

  There may be more errors present in the metadata file. To get a full report, sample/feature metadata files can be validated with Keemei: https://keemei.qiime2.org

  Find details on QIIME 2 metadata requirements here: https://docs.qiime2.org/2019.7/tutorials/metadata/

There was an issue with loading the file /data2/analyses/MetONTIIME/sample-metadata.tsv as metadata:

  There was an issue with loading the metadata file:

  Metadata must contain at least one ID.

  There may be more errors present in the metadata file. To get a full report, sample/feature metadata files can be validated with Keemei: https://keemei.qiime2.org

  Find details on QIIME 2 metadata requirements here: https://docs.qiime2.org/2019.7/tutorials/metadata/

Usage: qiime taxa collapse [OPTIONS]

  Collapse groups of features that have the same taxonomic assignment
  through the specified level. The frequencies of all features will be
  summed when they are collapsed.

Inputs:
  --i-table ARTIFACT FeatureTable[Frequency]
                         Feature table to be collapsed.             [required]
  --i-taxonomy ARTIFACT FeatureData[Taxonomy]
                         Taxonomic annotations for features in the provided
                         feature table. All features in the feature table must
                         have a corresponding taxonomic annotation. Taxonomic
                         annotations that are not present in the feature table
                         will be ignored.                           [required]
Parameters:
  --p-level INTEGER      The taxonomic level at which the features should be
                         collapsed. All ouput features will have exactly this
                         many levels of taxonomic annotation.       [required]
Outputs:
  --o-collapsed-table ARTIFACT FeatureTable[Frequency]
                         The resulting feature table, where all features are
                         now taxonomic annotations with the user-specified
                         number of levels.                          [required]
Miscellaneous:
  --output-dir PATH      Output unspecified results to a directory
  --verbose / --quiet    Display verbose output to stdout and/or stderr
                         during execution of this action. Or silence output if
                         execution is successful (silence is golden).
  --citations            Show citations and exit.
  --help                 Show this message and exit.

                  There were some problems with the command:                  
 (1/2) Invalid value for "--i-table": 'table.qza' is not a valid filepath
 (2/2) Invalid value for "--i-taxonomy": 'taxonomy.qza' is not a valid
  filepath
Usage: qiime tools export [OPTIONS]

  Exporting extracts (and optionally transforms) data stored inside an
  Artifact or Visualization. Note that Visualizations cannot be transformed
  with --output-format

Options:
  --input-path ARTIFACT/VISUALIZATION
                        Path to file that should be exported        [required]
  --output-path PATH    Path to file or directory where data should be
                        exported to                                 [required]
  --output-format TEXT  Format which the data should be exported as. This
                        option cannot be used with Visualizations
  --help                Show this message and exit.

                    There was a problem with the command:                     
 (1/1) Invalid value for "--input-path": File "table_collapsed.qza" does not
  exist.
mv: cannot stat 'feature-table.biom': No such file or directory
Usage: biom convert [OPTIONS]
Try "biom convert -h" for help.

Error: Invalid value for "-i" / "--input-fp": File "feature-table_absfreq.biom" does not exist.
Usage: qiime feature-table relative-frequency [OPTIONS]

  Convert frequencies to relative frequencies by dividing each frequency in
  a sample by the sum of frequencies in that sample.

Inputs:
  --i-table ARTIFACT FeatureTable[Frequency]
                       The feature table to be converted into relative
                       frequencies.                                 [required]
Outputs:
  --o-relative-frequency-table ARTIFACT FeatureTable[RelativeFrequency]
                       The resulting relative frequency feature table.
                                                                    [required]
Miscellaneous:
  --output-dir PATH    Output unspecified results to a directory
  --verbose / --quiet  Display verbose output to stdout and/or stderr during
                       execution of this action. Or silence output if
                       execution is successful (silence is golden).
  --citations          Show citations and exit.
  --help               Show this message and exit.

                    There was a problem with the command:                     
 (1/1) Invalid value for "--i-table": 'table_collapsed.qza' is not a valid
  filepath
There was an issue with loading the file table_collapsed_relfreq.qza as metadata:

  Metadata file path doesn't exist, or the path points to something other than a file. Please check that the path exists, has read permissions, and points to a regular file (not a directory): table_collapsed_relfreq.qza

  There may be more errors present in the metadata file. To get a full report, sample/feature metadata files can be validated with Keemei: https://keemei.qiime2.org

  Find details on QIIME 2 metadata requirements here: https://docs.qiime2.org/2019.7/tutorials/metadata/

Usage: qiime tools export [OPTIONS]

  Exporting extracts (and optionally transforms) data stored inside an
  Artifact or Visualization. Note that Visualizations cannot be transformed
  with --output-format

Options:
  --input-path ARTIFACT/VISUALIZATION
                        Path to file that should be exported        [required]
  --output-path PATH    Path to file or directory where data should be
                        exported to                                 [required]
  --output-format TEXT  Format which the data should be exported as. This
                        option cannot be used with Visualizations
  --help                Show this message and exit.

                    There was a problem with the command:                     
 (1/1) Invalid value for "--input-path": File "table_collapsed_relfreq.qza"
  does not exist.
mv: cannot stat 'feature-table.biom': No such file or directory
Usage: biom convert [OPTIONS]
Try "biom convert -h" for help.

Error: Invalid value for "-i" / "--input-fp": File "feature-table_relfreq.biom" does not exist.
cat: feature-table_absfreq.tsv: No such file or directory
cat: feature-table_absfreq.tsv: No such file or directory
MaestSi commented 5 years ago

Hi, I think the issue here is that your filenames should end with .fastq.gz and not .fq.gz. Let me know if this solves the issue!

splaisan commented 5 years ago

It looks you are right, I also thought about this and after renaming the analysis started on one core. Now still running, I will monitor from home during the we to see how much it progresses. Best, Stephane

MaestSi commented 5 years ago

Dereplication, which is used for obtaining appropriate artifacts should run with multiple cores, but unfortunately there is no option for multi-threading blast classifier in QIIME2. An alternative might be using vsearch instead, that is reported to achieve similar performances but, as far as I know, it performs global alignment. However, since my aim was just to reproduce the 16S EPI2ME pipeline, I did not perform any further tests on parameters or classifiers.

splaisan commented 5 years ago

Buongiorno Simone,

I ran again but on 10% of each set to go faster and it seems that something went wrong at the end as there are terminal outputs which look like complains. Could you please see what this could be?

run with MetONTIIIME-v1.4.0 attached my R-config and terminal outputs:

config_MinION_mobile_lab.R.txt run_percent.txt

Best, Stephane

MaestSi commented 5 years ago

Hi Stephane, I guess the problem is that you are running from directory /opt/biotools/MetONTIIME, but the two taxonomy artifacts are not in the working directory /data2/analyses/MetONTIIME. Since if you run only the last part of the pipeline as you are doing the configuration file is not loaded, you should either specify the full path to the PRJNA33175_Bacterial_sequences_sequence.qza and PRJNA33175_Bacterial_sequences_taxonomy.qza artifacts or copy them in the working directory. Moreover (but this is unrelated to the issue ) I saw that in your config file you have PIPELINE_DIR: /opt/biotools/MetONTIIME MINICONDA_DIR: /opt/biotools/miniconda3/ but the ':' does not reflect the R syntax.

splaisan commented 5 years ago

Credo di essere un 'budello cieco' come diceva lo zio! Thanks Simone for the nice catches and sorry for these stupid mistakes!, I probably was too much into snakemake when I edited the config ;-) I now re-ran the command with full path's. Let's see what comes out! Stephane

MaestSi commented 5 years ago

I guess you were able to run the MetONTIIME.sh script starting from the fastq files, right? If so, I am going to close the issue. Just in case someone else needs it in the future, I am posting below the command used to run the pipeline starting from fastq files (BC\<numbers>fastq.gz files are supposed to be in the \<working dir>) before running this:

source activate MetONTIIME_env
nohup ./MetONTIIME.sh <working_dir> </path/to/metadata file> </path/to/sequences qiime2 artifact> </path/to/taxonomy qiime2 artifact> <threads> &

Simone

MaestSi commented 5 years ago

From MetONTIIME-v1.5 on, the command would be:

source activate MetONTIIME_env
nohup ./MetONTIIME.sh <working_dir> </path/to/metadata file> </path/to/sequences qiime2 artifact> </path/to/taxonomy qiime2 artifact> <threads> <Taxonomic classifier> &

with \<Taxonomic classifier> being Blast or Vsearch.