Error in the get abundances step, by taxa not giving a response from Unipept.

Thomieh73 commented 3 years ago

Hi, I am running the nanoclust pipeline and I get an error in the get_abundances step that I am not understanding. I noticed it for multiple samples in my dataset. When I removed 12 samples from my analysis, my pipeline finished.

This is the input file for one of the samples: barcode26.trimmed.fastq.nanoclust_out.txt

id;reads_in_cluster;used_for_consensus;reads_after_corr;draft_id;sciname;taxid;length;per_ident
0;181;100;42;13ab7554-0823-4a77-a740-fb92763d8bda id=43;Candidatus Pelagibacter ubique HTCC1062;335992;1332;98.874
2;52;100;43;4d30559a-0e2e-4a4b-8fb2-464c37a509b9 id=10;Taeania maliponensis;1819564;1414;89.321
1;179;100;43;250f2279-6f02-47c1-9214-5d85b77c3376 id=92;Actinomarinicola tropica;2789776;1353;83.444
14;113;100;43;e730d27b-e8ac-4c97-b95c-d181d3624b8a id=60;Arenibacter nanhaiticus;558155;1417;92.802
18;63;100;43;2318b187-9a34-4ec0-94f0-185f27dbd9d0 id=34;Vicingus serpentipes;1926625;1417;87.862
22;84;100;43;37d2db52-9fdd-42d9-89f7-36e97cb5d13f id=4;Thalassotalea atypica;2054316;1434;92.678
12;72;100;43;52d92061-2e07-4e69-a78a-812ddc2f47e5 id=32;Gimesia maris;122;1459;87.731
4;57;100;44;5d2f6aec-09e3-47f0-b834-5e1e9e2d55cc id=17;Mesobacillus rigiliprofundi;1523158;1406;79.943
21;106;100;45;a87c285d-d8e0-46cb-b01e-df7945fec255 id=60;Coraliomargarita akajimensis;395922;1461;86.858
3;390;100;43;2e9306dd-f766-4b85-a273-a518dbaa2033 id=57;Methylotenera mobilis JLW8;583345;1443;93.763
24;186;100;43;4e907808-91a5-4700-a053-3194a2a38fde id=9;Alkalimarinus sediminis;1632866;1437;91.858
11;268;100;43;7f343204-08f0-45cc-b460-56ab9ff3b7d5 id=60;Alkalimarinus sediminis;1632866;1455;86.460
28;653;100;43;e9b91a9d-570d-4e6b-aba4-1989476e2fec id=27;Thalassomonas haliotis;485448;1439;95.483
9;60;100;43;747d1e37-e8ad-4a3f-af37-d2db083523c5 id=3;Altibacter lentus;1223410;1422;90.717
10;139;100;43;3dc2ee32-fff8-4e16-ab7a-6bd1fd68f971 id=50;Thiolapillus brandeum;1076588;1449;86.888
20;79;100;43;69f61715-71f1-443a-bdca-757af76e6235 id=23;Longimicrobium terrae;1639882;1402;83.024
27;105;100;43;dea0c225-44fb-447d-90c0-3b0966c6c067 id=48;Colwellia psychrerythraea;28229;1446;97.925
13;225;100;43;2360e51e-64ae-4bb1-a460-fb90dfd67d53 id=56;Polaribacter atrinae;1333662;1423;96.767
17;108;100;43;188ee7a4-ee57-4768-89c8-f776b65089fe id=94;Owenweeksia hongkongensis DSM 17368;926562;1412;87.890
26;60;100;44;7a0650e3-5ee5-4d28-9da6-61cb3c109162 id=51;Butyratibacter algicola;2029869;1184;92.230
25;151;100;42;a619e085-8431-49cb-afcf-2a95bafb9bf1 id=45;Pelagimonas varians;696760;1021;96.572
29;1573;100;43;6751acaa-f6a6-4f54-8390-74892f37d560 id=78;Thalassomonas haliotis;485448;1447;95.508
6;145;100;43;7ee9dbec-bbef-4115-9766-680fb3f80083 id=85;Owenweeksia hongkongensis DSM 17368;926562;1423;88.335
8;59;100;43;2426ac8d-0d30-4546-90d9-bc69ebc59995 id=37;Poseidonibacter lekithochrous;1904463;1407;95.736
16;1144;100;43;8f7269e2-4bfa-4e66-93a6-f04f6f583365 id=67;Glaciecola amylolytica;2489595;1421;96.904
19;140;100;44;ada6c9bd-7ab0-407a-b074-a48aafeb48df id=65;Gemmatimonas aurantiaca T-27;379066;1222;84.943
23;83;100;43;35f5b040-c792-463c-8528-630b82aea3bd id=10;Pseudohongiella nitratireducens;1768907;1436;94.777
15;2382;100;43;04c44679-e020-4e24-a05a-cbf3c3cdad62 id=80;Glaciecola amylolytica;2489595;1416;96.398
7;68;100;43;a2a507f7-e93c-40bc-bc59-5ac68aabf671 id=30;Poseidonibacter lekithochrous;1904463;1402;92.939
5;130;100;43;d5a1a6c7-6464-4ea7-be4c-be42d07e0c4f id=10;Phaeocystidibacter marisrubri;1577780;1416;90.042

And this is the .command.log output:

Traceback (most recent call last):
  File "/cluster/work/users/thhaverk/nanoclust_tmp/c0/7c1d408ddcb9bded4d2821d2232ea0/.command.sh", line 65, in <module>
    get_abundance(names,paths, "G")
  File "/cluster/work/users/thhaverk/nanoclust_tmp/c0/7c1d408ddcb9bded4d2821d2232ea0/.command.sh", line 59, in get_abundance
    df_final_grp = merge_abundance(dfs, tax_level)
  File "/cluster/work/users/thhaverk/nanoclust_tmp/c0/7c1d408ddcb9bded4d2821d2232ea0/.command.sh", line 49, in merge_abundance
    df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
  File "/cluster/work/users/thhaverk/nanoclust_tmp/c0/7c1d408ddcb9bded4d2821d2232ea0/.command.sh", line 49, in <listcomp>
    df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
  File "/cluster/work/users/thhaverk/nanoclust_tmp/c0/7c1d408ddcb9bded4d2821d2232ea0/.command.sh", line 28, in get_taxname
    return json.loads(complete_tax)[0][tax_level_tag]
IndexError: list index out of range

It seems as the genus step of the script is causing the error, but I am not understanding why?

This is the .command.sh file for the process:

#!/usr/bin/env python

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rc
import pandas as pd
from functools import reduce
import requests
import json
#https://unipept.ugent.be/apidocs/taxonomy

def get_taxname(tax_id,tax_level):
    tags = {"S": "species_name","G": "genus_name","F": "family_name","O":'order_name', "C": "class_name"}
    tax_level_tag = tags[tax_level]
    #Avoids pipeline crash due to "nan" classification output. Thanks to Qi-Maria from Github
    if str(tax_id) == "nan":
        tax_id = 1

    path = 'http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=' + str(int(tax_id)) + '&extra=true&names=true'
    complete_tax = requests.get(path).text

    #Checks for API correct response (field containing the tax name). Thanks to devinbrown from Github
    try:
        name = json.loads(complete_tax)[0][tax_level_tag]
    except:
        name = str(int(tax_id))

    return json.loads(complete_tax)[0][tax_level_tag]

def get_abundance_values(names,paths):
 dfs = []
    for name,path in zip(names,paths):
        data = pd.read_csv(path, index_col=False, sep=';').iloc[:,1:]

        total = sum(data['reads_in_cluster'])
        rel_abundance=[]

        for index,row in data.iterrows():
            rel_abundance.append(row['reads_in_cluster'] / total)

        data['rel_abundance'] = rel_abundance
        dfs.append(pd.DataFrame({'taxid': data['taxid'], 'rel_abundance': rel_abundance}))
        data.to_csv("" + name + "_nanoclust_out.txt")

    return dfs

def merge_abundance(dfs,tax_level):
    df_final = reduce(lambda left,right: pd.merge(left,right,on='taxid',how='outer').fillna(0), dfs)
    df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
    df_final_grp = df_final.groupby(["taxid"], as_index=False).sum()
    return df_final_grp

def get_abundance(names,paths,tax_level):
    if(not isinstance(paths, list)):
        paths = [paths]
        names = [names]

    dfs = get_abundance_values(names,paths)
    df_final_grp = merge_abundance(dfs, tax_level)
    df_final_grp.to_csv("rel_abundance_"+ names[0] + "_" + tax_level + ".csv", index = False)

paths = "barcode26.trimmed.fastq.nanoclust_out.txt"
names = "barcode26.trimmed.fastq"

get_abundance(names,paths, "G")
get_abundance(names,paths, "S")
get_abundance(names,paths, "O")
get_abundance(names,paths, "F")

Thomieh73 commented 3 years ago

Okay, Today I had run into the same problem with get_abundances. The difference with this run was that I had changed the min-length parameter to 1300.

I had one dataset crashing again. and now I explored which taxon was causing the error. I therefore extracted the following taxon list from barcode14.trimmed.fastq_nanoclust_out.txt

sciname,taxid
Glaciecola amylolytica,2489595
Thiolapillus brandeum,1076588
Candidatus Pelagibacter ubique HTCC1062,335992
Polaribacter butkevichii,218490
Gemmatimonas aurantiaca T-27,379066
Thalassomonas haliotis,485448
Arenibacter nanhaiticus,558155
Phaeocystidibacter marisrubri,1577780
Methylophilus methylotrophus,17
Poseidonibacter lekithochrous,1904463
Glaciecola amylolytica,2489595
Alkalimarinus sediminis,1632866
Actinomarinicola tropica,2789776
Owenweeksia hongkongensis DSM 17368,926562
Roseobacter litoralis,42443
Candidatus Pelagibacter ubique HTCC1062,335992

I then ran this command with the Tax IDs:

curl -X POST -H 'Accept: application/json' api.unipept.ugent.be/api/v1/taxonomy \
-d 'input[]=2489595' \
-d 'input[]=1076588' \
-d 'input[]=335992' \
-d 'input[]=218490' \
-d 'input[]=485448' \
-d 'input[]=558155' \
-d 'input[]=1577780' \
-d 'input[]=17' \
-d 'input[]=1904463' \
-d 'input[]=2489595' \
-d 'input[]=1632866' \
-d 'input[]=2789776' \
-d 'input[]=926562' \
-d 'input[]=42443' \
-d 'input[]=335992'

and this is the output:

{"taxon_id":2489595,"taxon_name":"Glaciecola sp. THG-3.7","taxon_rank":"species"},
{"taxon_id":1076588,"taxon_name":"Thiolapillus brandeum","taxon_rank":"species"},
{"taxon_id":335992,"taxon_name":"Candidatus Pelagibacter ubique HTCC1062","taxon_rank":"no rank"},
{"taxon_id":218490,"taxon_name":"Polaribacter butkevichii","taxon_rank":"species"},
{"taxon_id":485448,"taxon_name":"Thalassomonas haliotis","taxon_rank":"species"},
{"taxon_id":558155,"taxon_name":"Arenibacter nanhaiticus","taxon_rank":"species"},
{"taxon_id":1577780,"taxon_name":"Phaeocystidibacter marisrubri","taxon_rank":"species"},
{"taxon_id":17,"taxon_name":"Methylophilus methylotrophus","taxon_rank":"species"},
{"taxon_id":1904463,"taxon_name":"Arcobacter lekithochrous","taxon_rank":"species"},
{"taxon_id":2489595,"taxon_name":"Glaciecola sp. THG-3.7","taxon_rank":"species"},
{"taxon_id":1632866,"taxon_name":"Alkalimarinus sediminis","taxon_rank":"species"},
{"taxon_id":926562,"taxon_name":"Owenweeksia hongkongensis DSM 17368","taxon_rank":"no rank"},
{"taxon_id":42443,"taxon_name":"Roseobacter litoralis","taxon_rank":"species"},
{"taxon_id":335992,"taxon_name":"Candidatus Pelagibacter ubique HTCC1062","taxon_rank":"no rank"}]

Checking the results I notice that for one taxon, I do not get anything back. Actinomarinicola tropica,2789776

Running the line:

curl -X POST -H 'Accept: application/json' api.unipept.ugent.be/api/v1/taxonomy \
-d 'input[]=2789776'

gives me []

I then removed the sample, and restarted the pipeline. It crashed with another sample. CHecking the species overview of that sample I find:

sciname,taxid
Owenweeksia hongkongensis DSM 17368,926562
Leisingera methylohalidivorans DSM 14336,999552
Actinomarinicola tropica,2789776
Poseidonibacter lekithochrous,1904463
Owenweeksia hongkongensis DSM 17368,926562
Candidatus Pelagibacter ubique HTCC1062,335992
Glaciecola amylolytica,2489595
Polaribacter franzmannii ATCC 700399,1248440
Methylotenera mobilis JLW8,583345
Roseobacter litoralis,42443
Glaciecola amylolytica,2489595
Candidatus Pelagibacter ubique HTCC1062,335992
Alkalimarinus sediminis,1632866
Thalassomonas haliotis,485448
Nisaea nitritireducens,568392
Amylibacter marinus,1475483

I then test if removing Actinomarinicola tropica,2789776 from the file: barcode38.trimmed.fastq.nanoclust_out.txt is an option.

I ran the command: bash .command.run

That delivers me the following files:

barcode38.trimmed.fastq.nanoclust_out.txt
barcode38.trimmed.fastq_nanoclust_out.txt
rel_abundance_barcode38.trimmed.fastq_F.csv
rel_abundance_barcode38.trimmed.fastq_O.csv  
rel_abundance_barcode38.trimmed.fastq_G.csv
rel_abundance_barcode38.trimmed.fastq_S.csv

Since taxa giving a blank respons crash the pipeline, I have send an email to unipept, if they can check out this issue. I am unsure how to solve it otherwise.

This is the same as noted in issue #19

Thomieh73 commented 3 years ago

The reply I got from unipept on this error:

Thank you for contacting us. The taxon you provided `2789776` is not present in our database at this time, that’s why our API does not respond with a valid response to your query. The Unipept database is currently based on UniProt version 2020.01, but we are currently in the process of updating our database to the latest UniProt version.

I’ve also discovered a problem with the `try it` module on our website, but I was able to resolve it (you should be able to use this module again).

Please don’t hesitate to contact us if the problem persists or if you have any other questions.

Kind regards,

So this might be resolved in the near future...

Thomieh73 commented 3 years ago

it is unclear how long the updating of unipept will take:

I can’t give a hard deadline for this yet, since we are currently experiencing some issues with our server storage space that we need to sort out first. Our API returns an empty array in case none of the taxa ID’s you provided were found, so make sure that your pipeline can handle this case properly (we do not return an HTTP error code or anything in this case, just an empty array).

genomicsITER / NanoCLUST

Error in the get abundances step, by taxa not giving a response from Unipept. #32