Gaius-Augustus / TSEBRA

TSEBRA: Transcript Selector for BRAKER
46 stars 5 forks source link

How can we translate our TSEBRA result gtf to NCBI gtf #17

Closed ld9866 closed 2 years ago

ld9866 commented 2 years ago

Thank you for developing such good software! We now have very good results, but there are still some problems, we found that there are annotations such as 5'-UTR and 3'-UTR in the annotation results, but you do not have such annotations in the NCBI annotation results, we Want to convert to NCBI standard format for uploading to public database, what should we do? Thank you!

TSBRA: chr1 Genome CDS 23458 23861 . - 0 transcript_id "chr1-_long_reads1.PB.2.1"; gene_id "chr1-_g_9667"; chr1 Genome intron 23862 27145 . - . transcript_id "chr1-_long_reads1.PB.2.1"; gene_id "chr1-_g_9667"; chr1 Genome CDS 27146 27199 . - 0 transcript_id "chr1-_long_reads1.PB.2.1"; gene_id "chr1-_g_9667"; chr1 Genome 5'-UTR 27200 27349 . - . transcript_id "chr1-_long_reads1.PB.2.1"; gene_id "chr1-_g_9667"; chr1 Genome intron 27350 30694 . - . transcript_id "chr1-_long_reads1.PB.2.1"; gene_id "chr1-_g_9667"; chr1 Genome 5'-UTR 30695 30811 . - . transcript_id "chr1-_long_reads1.PB.2.1"; gene_id "chr1-_g_9667"; chr1 Genome start_codon 30421 30423 . + 0 transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome CDS 30421 30640 0.9 + 0 transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome exon 30421 30640 . + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome intron 30641 30725 0.91 + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome CDS 30726 30880 0.97 + 2 transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome exon 30726 30880 . + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome intron 30881 32344 0.97 + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome CDS 32345 32634 0.97 + 0 transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome exon 32345 32634 . + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome intron 32635 36028 1 + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome CDS 36029 36136 1 + 1 transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome exon 36029 36136 . + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome intron 36137 39439 1 + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome CDS 39440 39521 1 + 1 transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome exon 39440 39521 . + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome intron 39522 41971 1 + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome CDS 41972 42101 1 + 0 transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome exon 41972 42101 . + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome intron 42102 47328 1 + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome CDS 47329 47435 1 + 2 transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome exon 47329 47435 . + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome intron 47436 48427 1 + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome CDS 48428 48613 1 + 0 transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome exon 48428 48613 . + . transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome stop_codon 48611 48613 . + 0 transcript_id "chr1+_anno1.g8215.t1"; gene_id "chr1+_g_9668"; chr1 Genome 5'-UTR 54464 54738 . + . transcript_id "chr1+_long_reads1.PB.4.1"; gene_id "chr1+_g_13180"; chr1 Genome CDS 54739 55059 . + 0 transcript_id "chr1+_long_reads1.PB.4.1"; gene_id "chr1+_g_13180"; chr1 Genome stop_codon 55057 55059 . + 0 transcript_id "chr1+_long_reads1.PB.4.1"; gene_id "chr1+_g_13180"; chr1 Genome 3'-UTR 55060 56507 . + . transcript_id "chr1+_long_reads1.PB.4.1"; gene_id "chr1+_g_13180"; chr1 Genome 3'-UTR 129829 131695 . - . transcript_id "chr1-_long_reads1.PB.5.1"; gene_id "chr1-_g_9671"; chr1 Genome stop_codon 131696 131698 . - 0 transcript_id "chr1-_long_reads1.PB.5.1"; gene_id "chr1-_g_9671";

NCBI:

chr14 Genome gene 131991494 132027228 . - . ID=gene-PKD2L1;Dbxref=GeneID:100154788;Name=PKD2L1;gbkey=Gene;gene=PKD2L1;gene_biotype=protein_coding;coverage=1.0;sequence_ID=0.992;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-PKD2L1_0 chr14 Genome mRNA 131991494 132027228 . - . ID=rna-XM_001927121.4;Parent=gene-PKD2L1;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Name=XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;excepti> chr14 Genome exon 131991494 131992467 . - . ID=exon-XM_001927121.4-16;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclas> chr14 Genome exon 131992777 131992900 . - . ID=exon-XM_001927121.4-15;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclas> chr14 Genome exon 131993144 131993262 . - . ID=exon-XM_001927121.4-14;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclas> chr14 Genome exon 131994031 131994157 . - . ID=exon-XM_001927121.4-13;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclas> chr14 Genome exon 131994662 131994783 . - . ID=exon-XM_001927121.4-12;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclas> chr14 Genome exon 131994996 131995094 . - . ID=exon-XM_001927121.4-11;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclas> chr14 Genome exon 131995927 131996047 . - . ID=exon-XM_001927121.4-10;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclas> chr14 Genome exon 131996349 131996530 . - . ID=exon-XM_001927121.4-9;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclass> chr14 Genome exon 131996707 131996877 . - . ID=exon-XM_001927121.4-8;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclass> chr14 Genome exon 131997540 131997768 . - . ID=exon-XM_001927121.4-7;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclass> chr14 Genome exon 131997937 131998161 . - . ID=exon-XM_001927121.4-6;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclass> chr14 Genome exon 131998786 131998859 . - . ID=exon-XM_001927121.4-5;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclass> chr14 Genome exon 131998862 131999039 . - . ID=exon-XM_001927121.4-4;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclass> chr14 Genome exon 131999891 132000018 . - . ID=exon-XM_001927121.4-3;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclass> chr14 Genome exon 132026116 132026229 . - . ID=exon-XM_001927121.4-2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclass> chr14 Genome exon 132026888 132027228 . - . ID=exon-XM_001927121.4-1;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XM_001927121.4;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclass> chr14 Genome CDS 131992203 131992467 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 131992777 131992900 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 131993144 131993262 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 131994031 131994157 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 131994662 131994783 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 131994996 131995094 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 131995927 131996047 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 131996349 131996530 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 131996707 131996877 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 131997540 131997768 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 131997937 131998161 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 131998786 131998859 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 131998862 131999039 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 131999891 132000018 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 132026116 132026229 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr14 Genome CDS 132026888 132027122 . - . ID=cds-XP_001927156.2;Parent=rna-XM_001927121.4;Dbxref=GeneID:100154788,Genbank:XP_001927156.2;Name=XP_001927156.2;Note=The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exc> chr2 Genome gene 150596267 150632364 . + . ID=gene-PKD2L2;Dbxref=GeneID:100520011;Name=PKD2L2;gbkey=Gene;gene=PKD2L2;gene_biotype=protein_coding;coverage=0.981;sequence_ID=0.981;valid_ORFs=1;extra_copy_number=0;copy_num_ID=gene-PKD2L2_0

LarsGab commented 2 years ago

Hi,

the NCBI annotation you send looks like it is in GFF3 format. You can convert a TSEBRA output to gff3 using the scripts rename_gtf.py from this TSEBRA repository and gtf2gff.pl from scripts of Augstus. Let's say your output file is called tsebra.gtf, then you can convert it with:

rename_gtf.py --gtf tsebra.gtf --out tsebra_renamed.gtf
gtf2gff.pl < tsebra_renamed.gtf --out tsebra_renamed.gff3

You can find some more details for this in Issue https://github.com/Gaius-Augustus/TSEBRA/issues/9.

If you want to remove all 3'-UTR and 5'-UTR lines from the annotation file, you can accomplish that using the sed command in a ubuntu shell:

sed -i "/3'-UTR/d" tsebra.gtf
sed -i "/5'-UTR/d" tsebra.gtf

I hope this helps. Best, Lars

ld9866 commented 2 years ago

Thank you for your patient reply! Best yours,

LarsGab commented 2 years ago

I close this issue since all questions have been answered or it has been inactive for a long time.