Open rprops opened 7 years ago
Hi Ruben,
The issue here I think is that you are using a newer version of the NR which no longer uses the gids. Actually we have updated ClassifyContigNR.py to address this but have not documented that yet. You need to download the accession id to taxa id mapping from the NCBI and then pass that file with the -a to ClassifyContigNR.py, the alternative is to use the old gid to taxaid mapping file and legacy NR that I have made available:
https://desmandatabases.s3.climb.ac.uk/nr.faa
https://desmandatabases.s3.climb.ac.uk/gi_taxid_prot.dmp
https://desmandatabases.s3.climb.ac.uk/all_taxa_lineage_notnone.tsv
In which case you pass the gi_taxaid mappings with the -g option.
Thanks, Chris
Hi Chris,
Excellent, this solved my issue!
Only minor issue was that because I used a newer NR database not all my taxa ids could be mapped back to the lineage file you provided here (all_taxa_lineage_notnone.tsv). Do you have any suggestions on how to generate this file. I have managed to make one through the functions available in this repo, but the ClassifyContigNR.py is complaining: Seems like it can't deal well with partial lineages?
Thank you!
Ruben
Hi Ruben,
This script may solve the issue, it just takes the lineage file as first argument and outputs a modified file. I can put this in the repo if you like?
import argparse import sys, getopt import operator
from collections import defaultdict from collections import Counter
def main(argv):
parser = argparse.ArgumentParser()
parser.add_argument("lineage_file", help="text taxid to lineage mapping")
args = parser.parse_args()
for line in open(args.lineage_file):
line = line.rstrip()
tokens = line.split("\t")
taxaid = tokens.pop(0)
newtaxa = [taxaid]
currNotNone =''
for depth in range(7):
taxa = tokens[depth]
if taxa == 'None':
newtaxa.append(currNotNone+"_"+str(depth))
else:
newtaxa.append(taxa)
currNotNone = taxa
newline = '\t'.join(newtaxa)
sys.stdout.write(newline+"\n")
if name == "main": main(sys.argv[1:])
Dear Chris,
I've been trying to run DESMAN on some of my own data but I keep getting errors during the contig taxonomy assignment step described in your complete example. Specifically the
ClassifyContigNR.py
script generates this error:It seems that the error resides in the .m8 file? It is formatted as such:
When I look at the code of
ClassifyContigNR.py
the example input header has a second column that is formatted like thisgi|973180054|gb|KUL19018.1|
. While mine seem to lack the gid. I'm running diamond v0.8.33.Thanks for your feedback,
Ruben