Closed devindrown closed 3 years ago
Hi Devin, thank you again for your feedback,
It seems that your consensus_classification.csv
has some errors (maybe due to past executions with errors?). Our recommendation is to delete all output files for that sample and rerun again the pipeline to see if you are getting the same file issues.
consensus_classification.csv
has the following structure:
id;reads_in_cluster;used_for_consensus;reads_after_corr;draft_id;sciname;taxid;length;per_ident
Example line:
13;1314;100;41;2c3fc50f-da48-44e1-a06b-74bec98aaf93 id=96;Escherichia coli str. K-12 substr. MG1655;511145;1474;99.932
It doesn't seem like an API problem, but if some tax_name entries included special characters (especially ";" that we use as file separator) it would be problematic for the get_abundance process. Anyway, get_abundance and plot_abundances just make some calculations from the "main" output file that is consensus_classification.csv
so that file will be useful for analysis.
Thank you for the support, we hope you can get your data analyzed. Feel free to answer again with more information or open new issues.
Thank you for your assistance. I've started a clean run with just a single input sample. The run continues to terminate with error.
I looked in reported the working dir /data/NanoCLUST/work/2b/1a079249e3a73f9c15717a38a58844
There are two files in that directory
PERM16S_20201028.barcode01.qcreads_nanoclust_out.txt
PERM16S_20201028.barcode01.qcreads.nanoclust_out.txt
Looking at those files I see that most lines look the same and follow the specifications you mentioned, but there is a single offending sample that includes two scientific names. For example in PERM16S_20201028.barcode01.qcreads.nanoclust_out.txt:
58;1012;100;43;960ccf21-7860-45dd-969b-42a2de87e2b5 id=91;Bradyrhizobium viridifuturi;Bradyrhizobium mercantei;1904807;0.0
Looking at the information in cluster 58
$more cluster58/consensus_classification.csv
Bradyrhizobium viridifuturi;Bradyrhizobium mercantei;1654716;1904807;0.0;1411;99.150
Bradyrhizobium embrapense;630921;0.0;1411;99.150
Bradyrhizobium valentinum;1518501;0.0;1411;99.079
Bradyrhizobium jicamae;280332;0.0;1411;99.008
Bradyrhizobium elkanii;29448;0.0;1411;98.866
So it looks like the tax_name for this has two names with an offending ;
in-between. Any suggestions on how to fix this?
More complete error information below
Run Name: loving_mendel
####################################################
## nf-core/nanoclust execution completed unsuccessfully! ##
####################################################
The exit status of the task that caused the workflow execution to fail was: 1.
The full error message was:
Error executing process > 'get_abundances (1)'
Caused by:
Process `get_abundances (1)` terminated with an error exit status (1)
Command executed [/data/NanoCLUST/templates/get_abundance.py]:
Command error:
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
Traceback (most recent call last):
File ".command.sh", line 55, in <module>
get_abundance(names,paths, "C")
File ".command.sh", line 49, in get_abundance
df_final_grp = merge_abundance(dfs, tax_level)
File ".command.sh", line 39, in merge_abundance
df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
File ".command.sh", line 39, in <listcomp>
df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
File ".command.sh", line 18, in get_taxname
return json.loads(complete_tax)[0][tax_level_tag]
IndexError: list index out of range
Pipeline Configuration:
-----------------------
- Run Name: loving_mendel
- Reads: /data/PERM/PERM16S_20201028/sample_fasta/PERM16S_20201028.barcode01.qcreads.fastq
- Max Resources: 128 GB memory, 16 cpus, 10d time per job
- Container: docker - [:]
- Output dir: /data/PERM/PERM16S_20201028/NanoCLUST.BC01
- Launch dir: /data/NanoCLUST
- Working dir: /data/NanoCLUST/work
- Script dir: /data/NanoCLUST
- User: dmdrown
- Config Profile: docker
- Date Started: 2020-12-04T09:07:08.609779-09:00
- Date Completed: 2020-12-04T09:48:39.519139-09:00
- Pipeline script file path: /data/NanoCLUST/main.nf
- Pipeline script hash ID: 41a36e29b6db0c14a411b4f911c51f5e
- Nextflow Version: 20.10.0
- Nextflow Build: 5430
- Nextflow Compile Timestamp: 01-11-2020 15:14 UTC
Hi
I've finally found a fix to prevent the multiple tax in the same line and it is already pused on the master branch. I hope it works since I couldn't recreate the error of a multiple tax ID classification with our samples.
The classification for the problematic cluster had the same score for two database entries that correspond to the same exact sequence, so BLAST include both tax IDs in the same line. We encourage to explore classification results for each cluster in the consensus_classification.csv file (as you did), to better interpret NanoCLUST results.
Thank you for trying the application, I hope it is useful for you
Thank you. The latest update with the altered blastn query output does fix the multiple taxid issue.
A remaining issue is that some taxid's appear to be breaking the get_abundance.py
function
Traceback (most recent call last):
File ".command.sh", line 55, in <module>
get_abundance(names,paths, "C")
File ".command.sh", line 49, in get_abundance
df_final_grp = merge_abundance(dfs, tax_level)
File ".command.sh", line 39, in merge_abundance
df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
File ".command.sh", line 39, in <listcomp>
df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
File ".command.sh", line 18, in get_taxname
return json.loads(complete_tax)[0][tax_level_tag]
IndexError: list index out of range
Digging into this, I see that some of the taxon_id output from http://api.unipept.ugent.be/api/v1/taxonomy.json?input
have no output. For example http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=2715402&extra=true&names=true
returns []
. However, this taxon ID matches NCBI's Caballeronia ginsengisoli.
I'm wondering how the code might be altered to make it robust to these missing values.
Others taxonIDs don't have all of the required taxonomic information for some IDs. For example "taxon_id":988946
has the following output
[{"taxon_id":988946,"taxon_name":"Loriellopsis cavernicola","taxon_rank":"species","superkingdom_id":2,"superkingdom_name":"Bacteria","kingdom_id":null,"kingdom_name":"","subkingdom_id":null,"subkingdom_name":"","superphylum_id":null,"superphylum_name":"","phylum_id":1117,"phylum_name":"Cyanobacteria","subphylum_id":null,"subphylum_name":"","superclass_id":null,"superclass_name":"","class_id":null,"class_name":"","subclass_id":null,"subclass_name":"","infraclass_id":null,"infraclass_name":"","superorder_id":null,"superorder_name":"","order_id":1161,"order_name":"Nostocales","suborder_id":null,"suborder_name":"","infraorder_id":null,"infraorder_name":"","parvorder_id":null,"parvorder_name":"","superfamily_id":null,"superfamily_name":"","family_id":1892258,"family_name":"Symphyonemataceae","subfamily_id":null,"subfamily_name":"","tribe_id":null,"tribe_name":"","subtribe_id":null,"subtribe_name":"","genus_id":988945,"genus_name":"Loriellopsis","subgenus_id":null,"subgenus_name":"","species_group_id":null,"species_group_name":"","species_subgroup_id":null,"species_subgroup_name":"","species_id":988946,"species_name":"Loriellopsis cavernicola","subspecies_id":null,"subspecies_name":"","varietas_id":null,"varietas_name":"","forma_id":null,"forma_name":""}]
Notice that the class name is: "class_name":""
Looking at a list of taxids from my sample, I can see that some are missing names at the Class and others are missing at different taxonomic levels. I realize that these may be special cases and perhaps being handled OK.
Hi,
We apologize for the late response. We have integrated some minor changes into the NanoCLUST main branch. Issues with get_abundance.py and the API have been solved. Thank you very much for your time and specially for the error descriptions that have helped to solve this issue.
The latest commit seems to have erased the update you made previously (on December 11) to prevent the multiple tax in the same line. There you corrected a couple of lines in the blastn calls
https://github.com/genomicsITER/NanoCLUST/blob/6fd4a65d0fbef8c9038d70e35e297eca63fc9492/main.nf#L439 https://github.com/genomicsITER/NanoCLUST/blob/6fd4a65d0fbef8c9038d70e35e297eca63fc9492/main.nf#L450
This was modified such that the sscinames staxids
values became ssciname staxid
as below
blastn -query $consensus -db nr -remote -entrez_query "Bacteria [Organism]" -task blastn -dust no -outfmt "10 staxid ssciname evalue length score pident" -evalue 11 -max_hsps 50 -max_target_seqs 5 > consensus_classification.csv
blastn -query $consensus -db $db -task blastn -dust no -outfmt "10 ssciname staxid evalue length pident" -evalue 11 -max_hsps 50 -max_target_seqs 5 | sed 's/,/;/g' > consensus_classification.csv
Samples that contain reads that are identified as "Bradyrhizobium mercantei" appear to be breaking the pipeline during the estimate abundance phase. When I look into the
consensus_classification.csv
file I see that the reported name is reported twice as belowIs this caused by the DB or does this happen as a result of the api call? Any way to fix it?
-Devin
More error output for context below