genomicsITER / NanoCLUST

NanoCLUST is an analysis pipeline for UMAP-based classification of amplicon-based full-length 16S rRNA nanopore reads
MIT License
106 stars 48 forks source link

ValueError: invalid literal for int() with base 10: 'Bradyrhizobium mercantei' #19

Closed devindrown closed 3 years ago

devindrown commented 3 years ago

Samples that contain reads that are identified as "Bradyrhizobium mercantei" appear to be breaking the pipeline during the estimate abundance phase. When I look into the consensus_classification.csv file I see that the reported name is reported twice as below

Bradyrhizobium viridifuturi;Bradyrhizobium mercantei;1654716;1904807;0.0;1412;99.079

Is this caused by the DB or does this happen as a result of the api call? Any way to fix it?

-Devin

More error output for context below

Run Name: nasty_almeida

####################################################
## nf-core/nanoclust execution completed unsuccessfully! ##
####################################################
The exit status of the task that caused the workflow execution to fail was: 1.
The full error message was:

Error executing process > 'get_abundances (6)'

Caused by:
  Process `get_abundances (6)` terminated with an error exit status (1)

Command executed [/data/NanoCLUST/templates/get_abundance.py]:

Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  Traceback (most recent call last):
    File ".command.sh", line 55, in <module>
      get_abundance(names,paths, "C")
    File ".command.sh", line 49, in get_abundance
      df_final_grp = merge_abundance(dfs, tax_level)
    File ".command.sh", line 39, in merge_abundance
      df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
    File ".command.sh", line 39, in <listcomp>
      df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
    File ".command.sh", line 16, in get_taxname
      path = 'http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=' + str(int(tax_id)) + '&extra=true&names=true'
  ValueError: invalid literal for int() with base 10: 'Bradyrhizobium mercantei'

Work dir:
  /data/NanoCLUST/work/ae/a2eb5c00b570f7810d8dc2a60a7b5e

Pipeline Configuration:
-----------------------
 - Run Name: nasty_almeida
 - Reads: /data/PERM/PERM16S_20201028/sample_fasta/PERM16S_20201028.barcode*.qcreads.fastq
 - Max Resources: 128 GB memory, 16 cpus, 10d time per job
 - Container: docker - [:]
 - Output dir: /data/PERM/PERM16S_20201028/NanoCLUST.all
 - Launch dir: /data/NanoCLUST
 - Working dir: /data/NanoCLUST/work
 - Script dir: /data/NanoCLUST
 - User: dmdrown
 - Config Profile: docker
 - Date Started: 2020-12-03T06:51:55.682511-09:00
 - Date Completed: 2020-12-03T09:04:28.976429-09:00
 - Pipeline script file path: /data/NanoCLUST/main.nf
 - Pipeline script hash ID: 41a36e29b6db0c14a411b4f911c51f5e
 - Nextflow Version: 20.10.0
 - Nextflow Build: 5430
 - Nextflow Compile Timestamp: 01-11-2020 15:14 UTC

--
genomicsITER commented 3 years ago

Hi Devin, thank you again for your feedback,

It seems that your consensus_classification.csv has some errors (maybe due to past executions with errors?). Our recommendation is to delete all output files for that sample and rerun again the pipeline to see if you are getting the same file issues.

consensus_classification.csv has the following structure:

id;reads_in_cluster;used_for_consensus;reads_after_corr;draft_id;sciname;taxid;length;per_ident

Example line:

13;1314;100;41;2c3fc50f-da48-44e1-a06b-74bec98aaf93 id=96;Escherichia coli str. K-12 substr. MG1655;511145;1474;99.932

It doesn't seem like an API problem, but if some tax_name entries included special characters (especially ";" that we use as file separator) it would be problematic for the get_abundance process. Anyway, get_abundance and plot_abundances just make some calculations from the "main" output file that is consensus_classification.csv so that file will be useful for analysis.

Thank you for the support, we hope you can get your data analyzed. Feel free to answer again with more information or open new issues.

devindrown commented 3 years ago

Thank you for your assistance. I've started a clean run with just a single input sample. The run continues to terminate with error.

I looked in reported the working dir /data/NanoCLUST/work/2b/1a079249e3a73f9c15717a38a58844

There are two files in that directory

PERM16S_20201028.barcode01.qcreads_nanoclust_out.txt 
PERM16S_20201028.barcode01.qcreads.nanoclust_out.txt

Looking at those files I see that most lines look the same and follow the specifications you mentioned, but there is a single offending sample that includes two scientific names. For example in PERM16S_20201028.barcode01.qcreads.nanoclust_out.txt:

58;1012;100;43;960ccf21-7860-45dd-969b-42a2de87e2b5 id=91;Bradyrhizobium viridifuturi;Bradyrhizobium mercantei;1904807;0.0

Looking at the information in cluster 58

$more cluster58/consensus_classification.csv

Bradyrhizobium viridifuturi;Bradyrhizobium mercantei;1654716;1904807;0.0;1411;99.150
Bradyrhizobium embrapense;630921;0.0;1411;99.150
Bradyrhizobium valentinum;1518501;0.0;1411;99.079
Bradyrhizobium jicamae;280332;0.0;1411;99.008
Bradyrhizobium elkanii;29448;0.0;1411;98.866

So it looks like the tax_name for this has two names with an offending ; in-between. Any suggestions on how to fix this?

More complete error information below

Run Name: loving_mendel

####################################################
## nf-core/nanoclust execution completed unsuccessfully! ##
####################################################
The exit status of the task that caused the workflow execution to fail was: 1.
The full error message was:

Error executing process > 'get_abundances (1)'

Caused by:
  Process `get_abundances (1)` terminated with an error exit status (1)

Command executed [/data/NanoCLUST/templates/get_abundance.py]:

Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  Traceback (most recent call last):
    File ".command.sh", line 55, in <module>
      get_abundance(names,paths, "C")
    File ".command.sh", line 49, in get_abundance
      df_final_grp = merge_abundance(dfs, tax_level)
    File ".command.sh", line 39, in merge_abundance
      df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
    File ".command.sh", line 39, in <listcomp>
      df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
    File ".command.sh", line 18, in get_taxname
      return json.loads(complete_tax)[0][tax_level_tag]
  IndexError: list index out of range

Pipeline Configuration:
-----------------------
 - Run Name: loving_mendel
 - Reads: /data/PERM/PERM16S_20201028/sample_fasta/PERM16S_20201028.barcode01.qcreads.fastq
 - Max Resources: 128 GB memory, 16 cpus, 10d time per job
 - Container: docker - [:]
 - Output dir: /data/PERM/PERM16S_20201028/NanoCLUST.BC01
 - Launch dir: /data/NanoCLUST
 - Working dir: /data/NanoCLUST/work
 - Script dir: /data/NanoCLUST
 - User: dmdrown
 - Config Profile: docker
 - Date Started: 2020-12-04T09:07:08.609779-09:00
 - Date Completed: 2020-12-04T09:48:39.519139-09:00
 - Pipeline script file path: /data/NanoCLUST/main.nf
 - Pipeline script hash ID: 41a36e29b6db0c14a411b4f911c51f5e
 - Nextflow Version: 20.10.0
 - Nextflow Build: 5430
 - Nextflow Compile Timestamp: 01-11-2020 15:14 UTC
genomicsITER commented 3 years ago

Hi

I've finally found a fix to prevent the multiple tax in the same line and it is already pused on the master branch. I hope it works since I couldn't recreate the error of a multiple tax ID classification with our samples.

The classification for the problematic cluster had the same score for two database entries that correspond to the same exact sequence, so BLAST include both tax IDs in the same line. We encourage to explore classification results for each cluster in the consensus_classification.csv file (as you did), to better interpret NanoCLUST results.

Thank you for trying the application, I hope it is useful for you

devindrown commented 3 years ago

Thank you. The latest update with the altered blastn query output does fix the multiple taxid issue.

A remaining issue is that some taxid's appear to be breaking the get_abundance.py function

  Traceback (most recent call last):
    File ".command.sh", line 55, in <module>
      get_abundance(names,paths, "C")
    File ".command.sh", line 49, in get_abundance
      df_final_grp = merge_abundance(dfs, tax_level)
    File ".command.sh", line 39, in merge_abundance
      df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
    File ".command.sh", line 39, in <listcomp>
      df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
    File ".command.sh", line 18, in get_taxname
      return json.loads(complete_tax)[0][tax_level_tag]
  IndexError: list index out of range

Digging into this, I see that some of the taxon_id output from http://api.unipept.ugent.be/api/v1/taxonomy.json?input have no output. For example http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=2715402&extra=true&names=true returns []. However, this taxon ID matches NCBI's Caballeronia ginsengisoli.

I'm wondering how the code might be altered to make it robust to these missing values.

Others taxonIDs don't have all of the required taxonomic information for some IDs. For example "taxon_id":988946 has the following output

[{"taxon_id":988946,"taxon_name":"Loriellopsis cavernicola","taxon_rank":"species","superkingdom_id":2,"superkingdom_name":"Bacteria","kingdom_id":null,"kingdom_name":"","subkingdom_id":null,"subkingdom_name":"","superphylum_id":null,"superphylum_name":"","phylum_id":1117,"phylum_name":"Cyanobacteria","subphylum_id":null,"subphylum_name":"","superclass_id":null,"superclass_name":"","class_id":null,"class_name":"","subclass_id":null,"subclass_name":"","infraclass_id":null,"infraclass_name":"","superorder_id":null,"superorder_name":"","order_id":1161,"order_name":"Nostocales","suborder_id":null,"suborder_name":"","infraorder_id":null,"infraorder_name":"","parvorder_id":null,"parvorder_name":"","superfamily_id":null,"superfamily_name":"","family_id":1892258,"family_name":"Symphyonemataceae","subfamily_id":null,"subfamily_name":"","tribe_id":null,"tribe_name":"","subtribe_id":null,"subtribe_name":"","genus_id":988945,"genus_name":"Loriellopsis","subgenus_id":null,"subgenus_name":"","species_group_id":null,"species_group_name":"","species_subgroup_id":null,"species_subgroup_name":"","species_id":988946,"species_name":"Loriellopsis cavernicola","subspecies_id":null,"subspecies_name":"","varietas_id":null,"varietas_name":"","forma_id":null,"forma_name":""}]

Notice that the class name is: "class_name":"" Looking at a list of taxids from my sample, I can see that some are missing names at the Class and others are missing at different taxonomic levels. I realize that these may be special cases and perhaps being handled OK.

genomicsITER commented 3 years ago

Hi,

We apologize for the late response. We have integrated some minor changes into the NanoCLUST main branch. Issues with get_abundance.py and the API have been solved. Thank you very much for your time and specially for the error descriptions that have helped to solve this issue.

devindrown commented 3 years ago

The latest commit seems to have erased the update you made previously (on December 11) to prevent the multiple tax in the same line. There you corrected a couple of lines in the blastn calls

https://github.com/genomicsITER/NanoCLUST/blob/6fd4a65d0fbef8c9038d70e35e297eca63fc9492/main.nf#L439 https://github.com/genomicsITER/NanoCLUST/blob/6fd4a65d0fbef8c9038d70e35e297eca63fc9492/main.nf#L450

This was modified such that the sscinames staxids values became ssciname staxid as below

blastn -query $consensus -db nr -remote -entrez_query "Bacteria [Organism]" -task blastn -dust no -outfmt "10 staxid ssciname evalue length score pident" -evalue 11 -max_hsps 50 -max_target_seqs 5 > consensus_classification.csv

blastn -query $consensus -db $db -task blastn -dust no -outfmt "10 ssciname staxid evalue length pident" -evalue 11 -max_hsps 50 -max_target_seqs 5 | sed 's/,/;/g' > consensus_classification.csv