epi2me-labs / wf-metagenomics

Metagenomic classification of long-read sequencing data
Other
58 stars 23 forks source link

[Bug]: Error executing process > 'minimap_pipeline:minimap (1) #46

Closed Matth-Cbn closed 1 year ago

Matth-Cbn commented 1 year ago

What happened?

Hey there,

I’ve seen a similar error in your problems, but it doesn’t really match my personal documents. I just use your metagenomics workflow for my analyses and I have some troubles. I use my own database (Silva138.1) and my own SeqId2taxid, with this I have a problem with the minimap pipelines. I leave you my parameters and my analysis messages so that you can direct me.

Than you for you're help and if you have any questions so that I can tell you more if what I am doing so that you can help me do not hesitate. Sincerelly

Operating System

ubuntu 20.04

Workflow Execution

EPI2ME Labs desktop application

Workflow Execution - EPI2ME Labs Versions

No response

Workflow Execution - CLI Execution Profile

None

Workflow Version

wf-metagenomics v2.2.1

Relevant log output

{
  fastq: /media/stage/CL1/Stage/GD Biotech/data/Barcode01,
  classifier: minimap2,
  analyse_unclassified: true,
  database_set: ncbi_16s_18s,
  store_dir: store_dir,
  reference: /media/stage/CL1/Stage/GD Biotech/Database/silva_138.fna,
  bracken_level: S,
  port: 8080,
  host: localhost,
  out_dir: /home/stage/epi2melabs/instances/wf-metagenomics_18f085b8-5883-4bb2-a686-3870d380eb3d/output,
  min_len: 200,
  max_len: 2000,
  threads: 4,
  server_threads: 8,
  kraken_clients: 2,
  wf: {
    agent: epi2melabs/5.0.2
  }
}

runName             : Silva138_minimap2
  containerEngine     : docker
  launchDir           : /home/stage/epi2melabs/instances/wf-metagenomics_18f085b8-5883-4bb2-a686-3870d380eb3d
  workDir             : /home/stage/epi2melabs/instances/wf-metagenomics_18f085b8-5883-4bb2-a686-3870d380eb3d/work
  projectDir          : /home/stage/epi2melabs/workflows/epi2me-labs/wf-metagenomics
  userName            : stage
  profile             : standard
  configFiles         : /home/stage/epi2melabs/workflows/epi2me-labs/wf-metagenomics/nextflow.config
Input Options
  fastq               : /media/stage/CL1/Stage/GD Biotech/data/Barcode01
  classifier          : minimap2
  analyse_unclassified: true
Reference Options
  reference           : /media/stage/CL1/Stage/GD Biotech/Database/silva_138.fna
  database_sets       : [ncbi_16s_18s:[reference:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ncbi_targeted_loci_16s_18s.fna, refindex:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ncbi_targeted_loci_16s_18s.fna.fai, database:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ncbi_targeted_loci_kraken2.tar.gz, kmer_dist:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/database1000mers.kmer_distrib, ref2taxid:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ref2taxid.targloci.tsv, taxonomy:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2023-01-01.zip], ncbi_16s_18s_28s_ITS:[reference:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/ncbi_16s_18s_28s_ITS.fna, refindex:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/ncbi_16s_18s_28s_ITS.fna.fai, database:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/ncbi_16s_18s_28s_ITS_kraken2.tar.gz, kmer_dist:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/database1000mers.kmer_distrib, ref2taxid:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/ref2taxid.ncbi_16s_18s_28s_ITS.tsv, taxonomy:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2023-01-01.zip], PlusPF-8:[database:https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_08gb_20230314.tar.gz, taxonomy:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/new_taxdump_2023-03-01.zip], PlusPFP-8:[database:https://genome-idx.s3.amazonaws.com/kraken/k2_pluspfp_08gb_20230314.tar.gz, taxonomy:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/new_taxdump_2023-03-01.zip]]
Output Options
  out_dir             : /home/stage/epi2melabs/instances/wf-metagenomics_18f085b8-5883-4bb2-a686-3870d380eb3d/output
Advanced Options
  min_len             : 200
  max_len             : 2000
  threads             : 4
  server_threads      : 8
Other parameters
  process_label       : wfmetagenomics
!! Only displaying parameters that differ from the pipeline defaults !!
--------------------------------------------------------------------------------
If you use epi2me-labs/wf-metagenomics for your analysis please cite:
* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x
--------------------------------------------------------------------------------
This is epi2me-labs/wf-metagenomics v2.2.1.
--------------------------------------------------------------------------------
Checking inputs.
Checking custom reference exists
Checking custom reference index exists
Checking fastq input.
[41/e1c32c] Submitted process > minimap_pipeline:getVersions
[4f/e218b7] Submitted process > minimap_pipeline:getParams
[5a/e6005f] Submitted process > fastcat (1)
[c5/aa803e] Submitted process > minimap_pipeline:output (1)
Staging foreign file: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2023-01-01.zip
[7e/e55451] Submitted process > minimap_pipeline:output (2)
[de/6ea1b1] Submitted process > minimap_pipeline:unpackTaxonomy
[3e/2ab6d4] Submitted process > minimap_pipeline:minimap (1)
ERROR ~ Error executing process > 'minimap_pipeline:minimap (1)'
Caused by:
  Process `minimap_pipeline:minimap (1)` terminated with an error exit status (1)
Command executed:
  minimap2 -t "4"  -ax map-ont "silva_138.fna" "seqs.fastq.gz"     | samtools view -h -F 2304 -     | workflow-glue format_minimap2 - -o "Barcode01.minimap2.assignments.tsv" -r "ref2taxid.targloci.tsv"     | samtools sort -o "Barcode01.bam" -
  samtools index "Barcode01.bam"
  awk -F '\t' '{print $3}' "Barcode01.minimap2.assignments.tsv" > taxids.tmp
  taxonkit         --data-dir "taxonomy_dir"         lineage -R taxids.tmp         | workflow-glue aggregate_lineages -p "Barcode01.minimap2"
  file1=`cat *.json`
  echo "{"'"Barcode01"'": "$file1"}" >> temp
  cp "temp" "Barcode01.json"
Command exit status:
  1
Command output:
  (empty)
Command error:
  [M::mm_idx_gen::18.615*1.47] collected minimizers
  [12:48:01 - workflow_glue] Starting entrypoint.
  [M::mm_idx_gen::22.608*1.91] sorted minimizers
  [M::main::22.953*1.89] loaded/built the index for 510508 target sequence(s)
  [M::mm_mapopt_update::23.151*1.88] mid_occ = 11257
  [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 510508
  [M::mm_idx_stat::23.288*1.88] distinct minimizers: 7252591 (61.92% are singletons); average occurrences: 18.622; average spacing: 5.534; total length: 747391099
  Traceback (most recent call last):
    File "/home/epi2melabs/conda/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3652, in get_loc
      return self._engine.get_loc(casted_key)
    File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
    File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc
    File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
    File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
  KeyError: 'JN578465.1.1478'

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "/home/stage/epi2melabs/workflows/epi2me-labs/wf-metagenomics/bin/workflow-glue", line 7, in <module>
      cli()
    File "/home/stage/epi2melabs/workflows/epi2me-labs/wf-metagenomics/bin/workflow_glue/__init__.py", line 62, in cli
      args.func(args)
    File "/home/stage/epi2melabs/workflows/epi2me-labs/wf-metagenomics/bin/workflow_glue/format_minimap2.py", line 29, in main
      taxid = ref2taxid_df.at[aln.reference_name, 'taxid']
    File "/home/epi2melabs/conda/lib/python3.8/site-packages/pandas/core/indexing.py", line 2412, in __getitem__
      return super().__getitem__(key)
    File "/home/epi2melabs/conda/lib/python3.8/site-packages/pandas/core/indexing.py", line 2364, in __getitem__
      return self.obj._get_value(*key, takeable=self._takeable)
    File "/home/epi2melabs/conda/lib/python3.8/site-packages/pandas/core/frame.py", line 3887, in _get_value
      row = self.index.get_loc(index)
    File "/home/epi2melabs/conda/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3654, in get_loc
      raise KeyError(key) from err
  KeyError: 'JN578465.1.1478'
Work dir:
  /home/stage/epi2melabs/instances/wf-metagenomics_18f085b8-5883-4bb2-a686-3870d380eb3d/work/3e/2ab6d46c70e85acb6475e05f183388
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
 -- Check '/home/stage/epi2melabs/instances/wf-metagenomics_18f085b8-5883-4bb2-a686-3870d380eb3d/nextflow.log' file for details
nggvs commented 1 year ago

Hi! Thank you for using the workflow and also for providing the parameters. It's really useful to have that information. I suspect that the taxonomy database (which contains for each reference name, the lineage information) doesn't match the reference database (which contains the reference name and the sequence). I'll check this just to be sure that was happening here.

Matth-Cbn commented 1 year ago

Hi! You know how I can fix the problem if it is possible ? Thank you for your feedback and future feedback if you discover more.

nggvs commented 1 year ago

Hi,

You should provide a tab-separated file through the flag --ref2taxid which contains the reference of the sequence and the taxid. For example, if you have the reference: AYKI01000027.110818.112349 in your silva_138.fna,

in the ref2taxid file should appear: AYKI01000027.110818.112349 1352943

You can download the taxid from silva webpage, but please take into account that they use different taxid than NCBI. So there are two different options: use the file taxmap_embl-ebi_ena_ssu_ref_138.1.txt to extract NCBI taxid or if you use silva taxid, use the --taxonomy to provide a NCBI-style taxdump files for custom taxonomy suitable for your custom database.

Please let us know if the problem is still not solved.

Matth-Cbn commented 1 year ago

Thank you for your answer. First, I would like to clarify that we are launching the metagenomic pipeline from the application (not from the command line). Then we looked at your recommendations. We saw the KeyError message: 'JN578465.1.1478' in the results I sent you in the first message. When we look at what corresponds to JN578465.1.1478 in the base silva, we actually get the phylogeny of a bacterium: Streptococcus anginosus When we compare the file seqid2ncbitaxid.tsv we also get the taxid number of this bacteria. Taxonomic references are therefore also in our database. I will join you at the command terminal which allowed us to verify this so that you can see it (terminal capture) So we don’t really know what to do about. Terminal capture

nggvs commented 1 year ago

Hi, I'll try to reproduce it, I may miss something.

Just to be sure, which options are you using in the app? From your logfile I see that you're using your own reference, but I don't see that you're using the --Ref2taxid input option (in the App it is in the minimap2 options, ref2taxid and you would have to point to the file seqid2ncbitaxid.tsv) and in the log should appear as "--ref2taxid". If it is not provided, the wf uses the default one, which does not match the silva references.

  fastq: /media/stage/CL1/Stage/GD Biotech/data/Barcode01,
  classifier: minimap2,
  analyse_unclassified: true,
  database_set: ncbi_16s_18s,
  store_dir: store_dir,
  reference: /media/stage/CL1/Stage/GD Biotech/Database/silva_138.fna,
  bracken_level: S,
  port: 8080,
  host: localhost,
  out_dir: /home/stage/epi2melabs/instances/wf-metagenomics_18f085b8-5883-4bb2-a686-3870d380eb3d/output,
  min_len: 200,
  max_len: 2000,
  threads: 4,
  server_threads: 8,
  kraken_clients: 2,
  wf: {
    agent: epi2melabs/5.0.2
  }
}
Matth-Cbn commented 1 year ago

I use the reference options, parameters references for the silva database and the minimap2 options with the ref2taxid I had already tried this option, I just did it again with ref2taxid but the error persists

Core Nextflow options
  runName        : Test_minimap2_silva138
  containerEngine: docker
  launchDir      : /home/stage/epi2melabs/instances/wf-metagenomics_37c5bddf-85ba-412e-b360-b21db70a9edf
  workDir        : /home/stage/epi2melabs/instances/wf-metagenomics_37c5bddf-85ba-412e-b360-b21db70a9edf/work
  projectDir     : /home/stage/epi2melabs/workflows/epi2me-labs/wf-metagenomics
  userName       : stage
  profile        : standard
  configFiles    : /home/stage/epi2melabs/workflows/epi2me-labs/wf-metagenomics/nextflow.config
Input Options
  fastq          : /media/stage/CL1/Stage/GD Biotech/data/Barcode01
  classifier     : minimap2
Reference Options
  reference      : /media/stage/CL1/Stage/GD Biotech/Database/silva_138.fna
  database_sets  : [ncbi_16s_18s:[reference:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ncbi_targeted_loci_16s_18s.fna, refindex:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ncbi_targeted_loci_16s_18s.fna.fai, database:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ncbi_targeted_loci_kraken2.tar.gz, kmer_dist:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/database1000mers.kmer_distrib, ref2taxid:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ref2taxid.targloci.tsv, taxonomy:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2023-01-01.zip], ncbi_16s_18s_28s_ITS:[reference:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/ncbi_16s_18s_28s_ITS.fna, refindex:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/ncbi_16s_18s_28s_ITS.fna.fai, database:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/ncbi_16s_18s_28s_ITS_kraken2.tar.gz, kmer_dist:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/database1000mers.kmer_distrib, ref2taxid:https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/ref2taxid.ncbi_16s_18s_28s_ITS.tsv, taxonomy:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2023-01-01.zip], PlusPF-8:[database:https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_08gb_20230314.tar.gz, taxonomy:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/new_taxdump_2023-03-01.zip], PlusPFP-8:[database:https://genome-idx.s3.amazonaws.com/kraken/k2_pluspfp_08gb_20230314.tar.gz, taxonomy:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/new_taxdump_2023-03-01.zip]]
Minimap2 Options
  ref2taxid      : /media/stage/CL1/Stage/GD Biotech/Database/seqid2ncbitaxid.tsv
Output Options
  out_dir        : /home/stage/epi2melabs/instances/wf-metagenomics_37c5bddf-85ba-412e-b360-b21db70a9edf/output
Other parameters
  process_label  : wfmetagenomics
!! Only displaying parameters that differ from the pipeline defaults !!
--------------------------------------------------------------------------------
If you use epi2me-labs/wf-metagenomics for your analysis please cite:
* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x
--------------------------------------------------------------------------------
This is epi2me-labs/wf-metagenomics v2.2.1.
--------------------------------------------------------------------------------
Checking inputs.
Checking custom reference exists
Checking custom reference index exists
Checking custom ref2taxid mapping exists
Checking fastq input.
[1f/0f955e] Submitted process > minimap_pipeline:getParams
[13/82e9f5] Submitted process > minimap_pipeline:getVersions
[40/d7ff58] Submitted process > fastcat (1)
[1b/0a5c53] Submitted process > minimap_pipeline:output (1)
Staging foreign file: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2023-01-01.zip
[f5/b9724f] Submitted process > minimap_pipeline:output (2)
[26/db1871] Submitted process > minimap_pipeline:unpackTaxonomy
[c2/715f89] Submitted process > minimap_pipeline:minimap (1)
[cd/5345e8] Submitted process > minimap_pipeline:makeReport (1)
ERROR ~ Error executing process > 'minimap_pipeline:makeReport (1)'
Caused by:
  Process `minimap_pipeline:makeReport (1)` terminated with an error exit status (1)
Command executed:
  workflow-glue report         wf-metagenomics-report.html         --versions versions         --params params.json         --stats per-read-stats.tsv         --lineages lineages         --pipeline "minimap"
Command exit status:
  1
Command output:
  (empty)
Command error:
  [14:29:38 - workflow_glue] Starting entrypoint.
  [14:29:39 - Plotter   ] Cannot correct axis labels in complicated scenarios.
  [14:29:39 - Plotter   ] Cannot correct axis labels in complicated scenarios.
  [14:29:39 - Plotter   ] Cannot correct axis labels in complicated scenarios.
  [14:29:39 - Plotter   ] Cannot correct axis labels in complicated scenarios.
  [14:29:39 - Plotter   ] Cannot correct axis labels in complicated scenarios.
  Traceback (most recent call last):
    File "/home/stage/epi2melabs/workflows/epi2me-labs/wf-metagenomics/bin/workflow-glue", line 7, in <module>
      cli()
    File "/home/stage/epi2melabs/workflows/epi2me-labs/wf-metagenomics/bin/workflow_glue/__init__.py", line 62, in cli
      args.func(args)
    File "/home/stage/epi2melabs/workflows/epi2me-labs/wf-metagenomics/bin/workflow_glue/report.py", line 119, in main
      plt = ezc.barplot(
    File "/home/epi2melabs/conda/lib/python3.8/site-packages/ezcharts/plots/categorical.py", line 67, in barplot
      data = data.pivot(
    File "/home/epi2melabs/conda/lib/python3.8/site-packages/pandas/core/frame.py", line 8424, in pivot
      return pivot(self, index=index, columns=columns, values=values)
    File "/home/epi2melabs/conda/lib/python3.8/site-packages/pandas/core/reshape/pivot.py", line 557, in pivot
      result = indexed.unstack(columns_listlike)  # type: ignore[arg-type]
    File "/home/epi2melabs/conda/lib/python3.8/site-packages/pandas/core/series.py", line 4309, in unstack
      return unstack(self, level, fill_value)
    File "/home/epi2melabs/conda/lib/python3.8/site-packages/pandas/core/reshape/reshape.py", line 488, in unstack
      unstacker = _Unstacker(
    File "/home/epi2melabs/conda/lib/python3.8/site-packages/pandas/core/reshape/reshape.py", line 136, in __init__
      self._make_selectors()

    File "/home/epi2melabs/conda/lib/python3.8/site-packages/pandas/core/reshape/reshape.py", line 188, in _make_selectors
      raise ValueError("Index contains duplicate entries, cannot reshape")
  ValueError: Index contains duplicate entries, cannot reshape
Work dir:
  /home/stage/epi2melabs/instances/wf-metagenomics_37c5bddf-85ba-412e-b360-b21db70a9edf/work/cd/5345e89c3a067497644ddd9f188da0
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
 -- Check '/home/stage/epi2melabs/instances/wf-metagenomics_37c5bddf-85ba-412e-b360-b21db70a9edf/nextflow.log' file for details
nggvs commented 1 year ago

Hi, I apologize for the late answer. This should have been fixed in the last version (2.3.0). Also you can use now the Silva database (although please take into account that the taxids are different from those of the NCBI and that it only reaches the genus rank).

nggvs commented 1 year ago

Hi, Would you mind to confirm if this problem persists? If it has been solved, please feel free to close the issue.