andreaminio / AnnotationPipeline-EVM_based-DClab

Cantù Lab @ UC Davis - Annotation pipeline - EVM based
14 stars 10 forks source link

how can I get the database #4

Closed zhouyflab closed 2 years ago

zhouyflab commented 2 years ago

Hi

I have no idea that how I can get this “dedicated RefSeq protein dataset”

“Run blast search against a dedicated RefSeq protein dataset (in the example, plants proteins).”

I offer my sincere thanks to you!

shuo

zhouyflab commented 2 years ago

and What is the format of this file?

python /Scripts/mapGOs.py -m refseq2go.txt

and In the script folder there is only mapgo.py.

this is my refseq2go.txt :

NP_001267860.1 GO:0004674 NP_001267894.1 GO:0004674 NP_001267898.1 GO:0003677 NP_001267919.1 GO:0000976 NP_001267927.1 GO:0004674 NP_001267932.1 GO:0004674 NP_001267941.1 GO:0004674 NP_001267948.1 GO:0003700

this is the erro:

Traceback (most recent call last): File "tools/Scripts/mapGO.py", line 321, in convert(in_file, outfile) File "tools/Scripts/mapGO.py", line 197, in convert ref_id = sseqid.split("|")[1] IndexError: list index out of range

based on Python 2.7

andreaminio commented 2 years ago

Hi Shuo,

your refseq file seems correct. This how it should appear with more than one GO for entry:

YP_031579.1     GO:0006355; GO:0046782; GO:0006351
YP_031580.1     GO:0033644; GO:0016021
YP_031582.1     GO:0033644; GO:0016021
YP_031587.1     GO:0005524; GO:0003677; GO:0004386
YP_031588.1     GO:0033644; GO:0016021
YP_031589.1     GO:0033644; GO:0016021
YP_654585.1     GO:0033644; GO:0016021
YP_031597.1     GO:0005524; GO:0004674
YP_031598.1     GO:0033644; GO:0016021
YP_654594.1     GO:0033644; GO:0016021
YP_031605.1     GO:0016301
YP_031606.1     GO:0003677; GO:0003899; GO:0006351

2 columns, the first with the entry and the second with a semicolon separated list of GOs.

The issue, however, comes from the blast file. It seems it is not able to find the entry of the target sequence of the hit. As a result from a blast, the target name should be a string with multiple informations, all of them pipe delimited. The name of the reference sequence hit in the database is the second element of such a string. ref_id = sseqid.split("|")[1] is used to retrieve it, but apparently it fails to find the 2nd element in the string for some reason, like the structure of hit id is not as expected.

Can you check your blast results to find out if there is any issue with the reported reference ids?

Andrea

zhouyflab commented 2 years ago

Thanks Andrea!

Cheers, Yongfeng

--

Prof. Yongfeng Zhou Professor of Horticulture at AGIS, CAAS Population Genomics and Genomic Breeding of Grapevine Cell/Wechat: 13924607807, twitter: @yongfeng_zhou Email: @.**@. The Agricultural Genomics Institute at Shenzhen, The Chinese Academy of Agricultural Sciences

From: Andrea Minio @.> Date: 2022-05-31 00:24:49 To: andreaminio/AnnotationPipeline-EVM_based-DClab @.> Cc: zhouyflab @.>,Author @.> Subject: Re: [andreaminio/AnnotationPipeline-EVM_based-DClab] how can I get the database (Issue #4)

Hi Shuo, your refseq file seems correct. This how it should appear with more than one GO for entry: YP_031579.1 GO:0006355; GO:0046782; GO:0006351 YP_031580.1 GO:0033644; GO:0016021 YP_031582.1 GO:0033644; GO:0016021 YP_031587.1 GO:0005524; GO:0003677; GO:0004386 YP_031588.1 GO:0033644; GO:0016021 YP_031589.1 GO:0033644; GO:0016021 YP_654585.1 GO:0033644; GO:0016021 YP_031597.1 GO:0005524; GO:0004674 YP_031598.1 GO:0033644; GO:0016021 YP_654594.1 GO:0033644; GO:0016021 YP_031605.1 GO:0016301 YP_031606.1 GO:0003677; GO:0003899; GO:0006351 2 columns, the first with the entry and the second with a semicolon separated list of GOs. The issue, however, comes from the blast file. It seems it is not able to find the entry of the target sequence of the hit. As a result from a blast, the target name should be a string with multiple informations, all of them pipe delimited. The name of the reference sequence hit in the database is the second element of such a string. ref_id = sseqid.split("|")[1] is used to retrieve it, but apparently it fails to find the 2nd element in the string for some reason, like the structure of hit id is not as expected. Can you check your blast results to find out if there is any issue with the reported reference ids? Andrea — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

andreaminio commented 2 years ago

You're welcome!

See you soon, take care

Andrea