chrisquince / DESMAN

De novo Extraction of Strains from MetAgeNomes
Other
69 stars 22 forks source link

Issue Contig taxonomic assignment #15

Open rprops opened 7 years ago

rprops commented 7 years ago

Dear Chris,

I've been trying to run DESMAN on some of my own data but I keep getting errors during the contig taxonomy assignment step described in your complete example. Specifically the ClassifyContigNR.py script generates this error: image

It seems that the error resides in the .m8 file? It is formatted as such: image

When I look at the code of ClassifyContigNR.py the example input header has a second column that is formatted like this gi|973180054|gb|KUL19018.1|. While mine seem to lack the gid. I'm running diamond v0.8.33.

Thanks for your feedback,

Ruben

chrisquince commented 7 years ago

Hi Ruben,

The issue here I think is that you are using a newer version of the NR which no longer uses the gids. Actually we have updated ClassifyContigNR.py to address this but have not documented that yet. You need to download the accession id to taxa id mapping from the NCBI and then pass that file with the -a to ClassifyContigNR.py, the alternative is to use the old gid to taxaid mapping file and legacy NR that I have made available:

https://desmandatabases.s3.climb.ac.uk/nr.faa

https://desmandatabases.s3.climb.ac.uk/gi_taxid_prot.dmp

https://desmandatabases.s3.climb.ac.uk/all_taxa_lineage_notnone.tsv

In which case you pass the gi_taxaid mappings with the -g option.

Thanks, Chris

rprops commented 7 years ago

Hi Chris,

Excellent, this solved my issue!

Only minor issue was that because I used a newer NR database not all my taxa ids could be mapped back to the lineage file you provided here (all_taxa_lineage_notnone.tsv). Do you have any suggestions on how to generate this file. I have managed to make one through the functions available in this repo, but the ClassifyContigNR.py is complaining: image Seems like it can't deal well with partial lineages? image

Thank you!

Ruben

chrisquince commented 7 years ago

Hi Ruben,

This script may solve the issue, it just takes the lineage file as first argument and outputs a modified file. I can put this in the repo if you like?

Thanks, Chris

import argparse import sys, getopt import operator

from collections import defaultdict from collections import Counter

def main(argv):

parser = argparse.ArgumentParser()

parser.add_argument("lineage_file", help="text taxid to lineage mapping")

args = parser.parse_args()

for line in open(args.lineage_file):
    line = line.rstrip()

    tokens = line.split("\t")

    taxaid = tokens.pop(0)

    newtaxa = [taxaid]
    currNotNone =''
    for depth in range(7):
        taxa = tokens[depth]
        if taxa == 'None':
            newtaxa.append(currNotNone+"_"+str(depth))
        else:
            newtaxa.append(taxa)
            currNotNone = taxa
    newline = '\t'.join(newtaxa)

    sys.stdout.write(newline+"\n")

if name == "main": main(sys.argv[1:])