genomicsITER / NanoCLUST

NanoCLUST is an analysis pipeline for UMAP-based classification of amplicon-based full-length 16S rRNA nanopore reads
MIT License
106 stars 49 forks source link

Create a database (DB) from a custom database. #88

Open miangher opened 1 year ago

miangher commented 1 year ago

Good morning! I need your help! From bibi's database "leBIBI IV SSU-rDNA (16S) Automated ProKaryotes Phylogeny," I've tried to generate the necessary data for NanoCLUST to be able to use them when performing the analysis. I've used programs like BLAST+ 2.13.0 (makeblastdb) to try to obtain the following extensions: .ndb, .nhr, .nin, .nnd, .nni, .nog, .nos, .not, .nsq, .ntf, .nto, but there are always two extensions that don't appear: .nnd and .nni. When I run the program, I get the following error:

(Nextflow) cnr-strep@cnrstrep-Precision-3660:~/NanoCLUST$ nextflow run main.nf -profile docker --reads '/media/cnr-strep/ACC22AE1C22AB00E/FastqHAC-Lactobacillus/FastQ_Bichat/Fastq-HAC-16052022/barcode17/trimming/barcode17.filtered.fastq' --db 'db/16S_ribosomal_RNA' --tax 'db/taxdb/' N E X T F L O W ~ version 22.10.6 Launching main.nf [determined_ampere] DSL1 - revision: 2a51687d92


  _   __                     ________    __  _____________
 / | / /___ _____  ____     / ____/ /   / / / / ___/_  __/
/  |/ / __ `/ __ \/ __ \   / /   / /   / / / /\__ \ / /   

/ /| / // / / / / // / / // // /_/ // // /
/
/ |/_,// /_/_/ ____/
/__//__//_/

NanoCLUST v1.0dev

Run Name : determined_ampere Reads : /media/cnr-strep/ACC22AE1C22AB00E/FastqHAC-Lactobacillus/FastQ_Bichat/Fastq-HAC-16052022/barcode17/trimming/barcode17.filtered.fastq Max Resources : 128 GB memory, 16 cpus, 10d time per job Container : docker - [:] Output dir : ./results Launch dir : /home/cnr-strep/NanoCLUST Working dir : /home/cnr-strep/NanoCLUST/work Script dir : /home/cnr-strep/NanoCLUST User : cnr-strep Config Profile : docker

executor > local (23) [8b/15691c] process > QC (1) [100%] 1 of 1 ✔ [5e/36346c] process > fastqc (1) [100%] 1 of 1 ✔ [3c/3e9715] process > kmer_freqs (1) [100%] 1 of 1 ✔ [26/7c4085] process > read_clustering (1) [100%] 1 of 1 ✔ [3c/d03ee2] process > split_by_cluster (1) [100%] 1 of 1 ✔ [96/2c6b76] process > read_correction (3) [100%] 3 of 3 ✔ [bb/f3035a] process > draft_selection (3) [100%] 3 of 3 ✔ [21/1e894c] process > racon_pass (3) [100%] 3 of 3 ✔ [bf/28314d] process > medaka_pass (3) [100%] 3 of 3 ✔ [90/ad3c87] process > consensus_classification (3) [100%] 3 of 3 ✔ [07/d23aa0] process > join_results (1) [100%] 1 of 1 ✔ [4f/f929af] process > get_abundances (1) [ 0%] 0 of 1 [- ] process > plot_abundances - [fe/e2e60d] process > output_documentation [100%] 1 of 1 ✔ Error executing process > 'get_abundances (1)'

Caused by: Process get_abundances (1) terminated with an error exit status (1)

Command executed [/home/cnr-strep/NanoCLUST/templates/get_abundance.py]:

!/usr/bin/env python

import numpy as np import matplotlib.pyplot as plt from matplotlib import rc import pandas as pd from functools import reduce import requests import json

https://unipept.ugent.be/apidocs/taxonomy

def get_taxname(tax_id,tax_level): tags = {"S": "species_name","G": "genus_name","F": "family_name","O":'order_name', "C": "class_name"} tax_level_tag = tags[tax_level]

Avoids pipeline crash due to "nan" classification output. Thanks to Qi-Maria from Github

  if str(tax_id) == "nan":
      tax_id = 1

  path = 'http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=' + str(int(tax_id)) + '&extra=true&names=true'
  complete_tax = requests.get(path).text

  #Checks for API correct response (field containing the tax name). Thanks to devinbrown from Github
  try:
      name = json.loads(complete_tax)[0][tax_level_tag]
  except:
      name = str(int(tax_id))

  return json.loads(complete_tax)[0][tax_level_tag]

def get_abundance_values(names,paths): dfs = [] for name,path in zip(names,paths): data = pd.read_csv(path, index_col=False, sep=';').iloc[:,1:]

      total = sum(data['reads_in_cluster'])
      rel_abundance=[]

      for index,row in data.iterrows():
          rel_abundance.append(row['reads_in_cluster'] / total)

      data['rel_abundance'] = rel_abundance
      dfs.append(pd.DataFrame({'taxid': data['taxid'], 'rel_abundance': rel_abundance}))
      data.to_csv("" + name + "_nanoclust_out.txt")

executor > local (23) [8b/15691c] process > QC (1) [100%] 1 of 1 ✔ [5e/36346c] process > fastqc (1) [100%] 1 of 1 ✔ [3c/3e9715] process > kmer_freqs (1) [100%] 1 of 1 ✔ [26/7c4085] process > read_clustering (1) [100%] 1 of 1 ✔ [3c/d03ee2] process > split_by_cluster (1) [100%] 1 of 1 ✔ [96/2c6b76] process > read_correction (3) [100%] 3 of 3 ✔ [bb/f3035a] process > draft_selection (3) [100%] 3 of 3 ✔ [21/1e894c] process > racon_pass (3) [100%] 3 of 3 ✔ [bf/28314d] process > medaka_pass (3) [100%] 3 of 3 ✔ [90/ad3c87] process > consensus_classification (3) [100%] 3 of 3 ✔ [07/d23aa0] process > join_results (1) [100%] 1 of 1 ✔ [4f/f929af] process > get_abundances (1) [100%] 1 of 1, failed: 1 ✘ [- ] process > plot_abundances - [fe/e2e60d] process > output_documentation [100%] 1 of 1 ✔ Execution cancelled -- Finishing pending tasks before exit [nf-core/nanoclust] Pipeline completed with errors Error executing process > 'get_abundances (1)'

Caused by: Process get_abundances (1) terminated with an error exit status (1)

Command executed [/home/cnr-strep/NanoCLUST/templates/get_abundance.py]:

!/usr/bin/env python

import numpy as np import matplotlib.pyplot as plt from matplotlib import rc import pandas as pd from functools import reduce import requests import json

https://unipept.ugent.be/apidocs/taxonomy

def get_taxname(tax_id,tax_level): tags = {"S": "species_name","G": "genus_name","F": "family_name","O":'order_name', "C": "class_name"} tax_level_tag = tags[tax_level]

Avoids pipeline crash due to "nan" classification output. Thanks to Qi-Maria from Github

  if str(tax_id) == "nan":
      tax_id = 1

  path = 'http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=' + str(int(tax_id)) + '&extra=true&names=true'
  complete_tax = requests.get(path).text

  #Checks for API correct response (field containing the tax name). Thanks to devinbrown from Github
  try:
      name = json.loads(complete_tax)[0][tax_level_tag]
  except:
      name = str(int(tax_id))

  return json.loads(complete_tax)[0][tax_level_tag]

def get_abundance_values(names,paths): dfs = [] for name,path in zip(names,paths): data = pd.read_csv(path, index_col=False, sep=';').iloc[:,1:]

      total = sum(data['reads_in_cluster'])
      rel_abundance=[]

      for index,row in data.iterrows():
          rel_abundance.append(row['reads_in_cluster'] / total)

      data['rel_abundance'] = rel_abundance
      dfs.append(pd.DataFrame({'taxid': data['taxid'], 'rel_abundance': rel_abundance}))
      data.to_csv("" + name + "_nanoclust_out.txt")

executor > local (23) [8b/15691c] process > QC (1) [100%] 1 of 1 ✔ [5e/36346c] process > fastqc (1) [100%] 1 of 1 ✔ [3c/3e9715] process > kmer_freqs (1) [100%] 1 of 1 ✔ [26/7c4085] process > read_clustering (1) [100%] 1 of 1 ✔ [3c/d03ee2] process > split_by_cluster (1) [100%] 1 of 1 ✔ [96/2c6b76] process > read_correction (3) [100%] 3 of 3 ✔ [bb/f3035a] process > draft_selection (3) [100%] 3 of 3 ✔ [21/1e894c] process > racon_pass (3) [100%] 3 of 3 ✔ [bf/28314d] process > medaka_pass (3) [100%] 3 of 3 ✔ [90/ad3c87] process > consensus_classification (3) [100%] 3 of 3 ✔ [07/d23aa0] process > join_results (1) [100%] 1 of 1 ✔ [4f/f929af] process > get_abundances (1) [100%] 1 of 1, failed: 1 ✘ [- ] process > plot_abundances - [fe/e2e60d] process > output_documentation [100%] 1 of 1 ✔ Execution cancelled -- Finishing pending tasks before exit [nf-core/nanoclust] Pipeline completed with errors WARN: Graphviz is required to render the execution DAG in the given format -- See http://www.graphviz.org for more info. Error executing process > 'get_abundances (1)'

Caused by: Process get_abundances (1) terminated with an error exit status (1)

Command executed [/home/cnr-strep/NanoCLUST/templates/get_abundance.py]:

!/usr/bin/env python

import numpy as np import matplotlib.pyplot as plt from matplotlib import rc import pandas as pd from functools import reduce import requests import json

https://unipept.ugent.be/apidocs/taxonomy

def get_taxname(tax_id,tax_level): tags = {"S": "species_name","G": "genus_name","F": "family_name","O":'order_name', "C": "class_name"} tax_level_tag = tags[tax_level]

Avoids pipeline crash due to "nan" classification output. Thanks to Qi-Maria from Github

  if str(tax_id) == "nan":
      tax_id = 1

  path = 'http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=' + str(int(tax_id)) + '&extra=true&names=true'
  complete_tax = requests.get(path).text

  #Checks for API correct response (field containing the tax name). Thanks to devinbrown from Github
  try:
      name = json.loads(complete_tax)[0][tax_level_tag]
  except:
      name = str(int(tax_id))

  return json.loads(complete_tax)[0][tax_level_tag]

def get_abundance_values(names,paths): dfs = [] for name,path in zip(names,paths): data = pd.read_csv(path, index_col=False, sep=';').iloc[:,1:]

      total = sum(data['reads_in_cluster'])
      rel_abundance=[]

      for index,row in data.iterrows():
          rel_abundance.append(row['reads_in_cluster'] / total)

      data['rel_abundance'] = rel_abundance
      dfs.append(pd.DataFrame({'taxid': data['taxid'], 'rel_abundance': rel_abundance}))
      data.to_csv("" + name + "_nanoclust_out.txt")

  return dfs

def merge_abundance(dfs,tax_level): df_final = reduce(lambda left,right: pd.merge(left,right,on='taxid',how='outer').fillna(0), dfs) df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()] df_final_grp = df_final.groupby(["taxid"], as_index=False).sum() return df_final_grp

def get_abundance(names,paths,tax_level): if(not isinstance(paths, list)): paths = [paths] names = [names]

  dfs = get_abundance_values(names,paths)
  df_final_grp = merge_abundance(dfs, tax_level)
  df_final_grp.to_csv("rel_abundance_"+ names[0] + "_" + tax_level + ".csv", index = False)

paths = "barcode17.filtered.nanoclust_out.txt" names = "barcode17.filtered"

get_abundance(names,paths, "G") get_abundance(names,paths, "S") get_abundance(names,paths, "O") get_abundance(names,paths, "F")

Command exit status: 1

Command output: (empty)

Command error: Traceback (most recent call last): File ".command.sh", line 65, in get_abundance(names,paths, "G") File ".command.sh", line 59, in get_abundance df_final_grp = merge_abundance(dfs, tax_level) File ".command.sh", line 49, in merge_abundance df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()] File ".command.sh", line 49, in df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()] File ".command.sh", line 28, in get_taxname return json.loads(complete_tax)[0][tax_level_tag] IndexError: list index out of range

Work dir: /home/cnr-strep/NanoCLUST/work/4f/f929af73009d063bc5793e38804f62

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out (Nextflow) cnr-strep@cnrstrep-Precision-3660:~/NanoCLUST$

Please, could you guide me on how to generate a database that can be interpreted by NanoCLUST from a FASTA file containing a list of selected 16S sequences?

Thank you very much!

Miguel Angel Hernandez