NCBI-Hackathons / Master_gff3_parser

Convert sequence IDs between ucsc/refseq/genbank
MIT License
16 stars 5 forks source link

Additional examples #2

Closed childers closed 7 years ago

childers commented 7 years ago

Terence had some additional examples for us to test with:

For more testing, here are two assemblies with lots of sequences, so the mapping table is big: https://www.ncbi.nlm.nih.gov/assembly/GCF_000715135.1 https://www.ncbi.nlm.nih.gov/assembly/GCF_000233375.1

The program should gracefully fail given an assembly report like this one: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/180/655/GCA_000180655.1_ASM18065v1/GCA_000180655.1_ASM18065v1_assembly_report.txt As I mentioned, we’re planning to switch to always populating the file so cases like that will go away. It’s also never the case for RefSeq assemblies.

guilhemfaure commented 7 years ago

Thanks, will test it very soon!

On Tue, Mar 21, 2017 at 10:58 AM, childers notifications@github.com wrote:

Terence had some additional examples for us to test with:

For more testing, here are two assemblies with lots of sequences, so the mapping table is big: https://www.ncbi.nlm.nih.gov/assembly/GCF_000715135.1 https://www.ncbi.nlm.nih.gov/assembly/GCF_000233375.1

The program should gracefully fail given an assembly report like this one: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/180/655/ GCA_000180655.1_ASM18065v1/GCA_000180655.1_ASM18065v1_assembly_report.txt As I mentioned, we’re planning to switch to always populating the file so cases like that will go away. It’s also never the case for RefSeq assemblies.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NCBI-Hackathons/Master_gff3_parser/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AC4N0woA9LLE7irXNiUnEGIulizLw_Ggks5rn-WigaJpZM4Mj7pz .

-- Guilhem Faure, Ph.D Computational Biologist -Evolution from Genomics and Structures- LinkedIn: http://goog_96224789http://www.linkedin.com/in/guilhemfaure

childers commented 7 years ago

For tobacco, there are no alternative IDs to convert to (not even genbank IDs).

It does work if we convert from refSeq to refSeq:

$ time seqconv convert --ref Ntab-TN90 --out rs ref_Ntab-TN90_top_level.gff3.gz >test_tobacco_gb.gff3
Converting from None to rs
Starting Conversion
FORMAT detected: rs
real    0m16.931s
user    0m14.429s
sys 0m1.302s
childers commented 7 years ago

Fro Salmon, it appears to work ok:

$ time seqconv convert --ref ICSASG_v2 --out gb  ref_ICSASG_v2_top_level.gff3.gz> test_salmon.gff3
Converting from None to gb
Starting Conversion
No corresponding id for nc_001960.1 from rs
FORMAT detected: rs
real    0m50.122s
user    0m37.864s
sys 0m3.102s
childers commented 7 years ago

text output

$ head -n 20 test_salmon.gff3
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_assembly_report.txt
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ICSASG_v2
#!genome-build-accession NCBI_Assembly:GCF_000233375.1
#!annotation-date 22 September 2015
#!annotation-source NCBI Salmo salar Annotation Release 100
##sequence-region CM003279.1 1 159038749
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=8030
CM003279.1  RefSeq  region  1   159038749   .   +   .   ID=id0;Dbxref=taxon:8030;Name=ssa01;breed=double haploid;chromosome=ssa01;dev-stage=adult;gbkey=Src;genome=chromosome;isolate=Sally;mol_type=genomic DNA;sex=female;tissue-type=muscle
CM003279.1  Gnomon  gene    5501    62139   .   -   .   ID=gene0;Dbxref=GeneID:106560212;Name=LOC106560212;gbkey=Gene;gene=LOC106560212;gene_biotype=protein_coding
CM003279.1  Gnomon  mRNA    5501    62139   .   -   .   ID=rna0;Parent=gene0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;Name=XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
CM003279.1  Gnomon  exon    61647   62139   .   -   .   ID=id1;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
CM003279.1  Gnomon  exon    43486   43714   .   -   .   ID=id2;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
CM003279.1  Gnomon  exon    23978   24241   .   -   .   ID=id3;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
CM003279.1  Gnomon  exon    16966   17019   .   -   .   ID=id4;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
CM003279.1  Gnomon  exon    5501    5691    .   -   .   ID=id5;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
CM003279.1  Gnomon  CDS 43486   43633   .   -   0   ID=cds0;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1
CM003279.1  Gnomon  CDS 23978   24241   .   -   2   ID=cds0;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1