chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
602 stars 244 forks source link

Genbank Parsing Problem? #52

Closed mercutio22 closed 11 years ago

mercutio22 commented 12 years ago

Hi Brad, AnnotationSketch is complaining about the parsed file again:

GenomeTools error: CDS feature on line 27 in file "../../mirna-django/src/scripts/tp53.gff3" has the wrong phase 0 (should be 1)

I don't know if the problem is with their GFF3 parser though. Can you tell me what you think?

http://paste.debian.net/159462/

chapmanb commented 12 years ago

Hugo; I think that the phase is correct but happy to adjust if the GenomeTools folks think otherwise. The GFF spec specifies the phase as 0,1 or 2:

http://www.sequenceontology.org/gff3.shtml

while codon_start from the GenBank file is 1, 2 or 3:

http://www.ddbj.nig.ac.jp/FT/full_index.html#7.2

so I've made the adjustment from 1 to 0 in the GFF output when converting. Let me know if your interaction with the GenomeTools developers indicate I've missed something in the conversion.

mercutio22 commented 12 years ago

Thanks Brad. I will contact them and will let you know asap.

 .''.      Hugo A. M. Torres : :' : . '   “Talk is cheap,  -    show me the code. ”  -- L. Torvalds.

On Mon, Mar 12, 2012 at 3:04 PM, Brad Chapman reply@reply.github.com wrote:

Hugo; I think that the phase is correct but happy to adjust if the GenomeTools folks think otherwise. The GFF spec specifies the phase as 0,1 or 2:

http://www.sequenceontology.org/gff3.shtml

while codon_start from the GenBank file is 1, 2 or 3:

http://www.ddbj.nig.ac.jp/FT/full_index.html#7.2

so I've made the adjustment from 1 to 0 in the GFF output when converting. Let me know if your interaction with the GenomeTools developers indicate I've missed something in the conversion.


Reply to this email directly or view it on GitHub: https://github.com/chapmanb/bcbb/issues/52#issuecomment-4457847

mercutio22 commented 12 years ago

HI Brad, perhaps this might be useful for testing your program: http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online

I tried and the tool pointed for instance is that the produced gff3 file file has a "source" field. IIRC Peter Cock in one his blog posts says genbank has those but GFF3 does not.

Here, I paste you a sample report:

GFF3 File Validation Report

ontology_file(s):

http://song.cvs.sourceforge.net/*checkout*/song/ontology/so.obo

generated: 12-Mar-12 15:27:10

###############################################################################

THIS FILE HAS NOT BEEN VALIDATED, IT CONTAINS ERRORS, PLEASE REVIEW REPORT!

(NO WARNINGS HAVE BEEN ISSUED FOR THIS FILE)

###############################################################################

###############################################################################

THIS FILE HAS BEEN PROCESSED ENTIRELY AND ALL ERRORS/WARNINGS ARE REPORTED!

###############################################################################

First 10 lines of the analyzed GFF3 file follows:

# [line 1]> ##gff-version 3 [line 2]> ##sequence-region NG_017013.1 1 26144 [line 3]> NG_017013.1 annotation remark 1 26144 .
[line 3]> . . comment=REVIEWED%20REFSEQ%3A%20This%20record%20has%20been%20curated%20by%20NCBI%20staff%20in%0Acollaboration%20with%20Graham%20Taylor.%20The%20reference%20sequence%20was%0Aderived%20from%20AC087388.9%20and%20AC007421.13.%0AThis%20sequence%20is%20a%20reference%20standard%20in%20the%20RefSeqGene%20project.%0APublication%20Note%3A%20%20This%20RefSeq%20record%20includes%20a%20subset%20of%20the%0Apublications%20that%20are%20available%20for%20this%20gene.%20Please%20see%20the%20Gene%0Arecord%20to%20access%20additional%20publications.%0ASummary%3A%20This%20gene%20encodes%20tumor%20protein%20p53%2C%20which%20responds%20to%0Adiverse%20cellular%20stresses%20to%20regulate%20target%20genes%20that%20induce%20cell%0Acycle%20arrest%2C%20apoptosis%2C%20senescence%2C%20DNA%20repair%2C%20or%20changes%20in%0Ametabolism.%20p53%20protein%20is%20expressed%20at%20low%20level%20in%20normal%20cells%0Aand%20at%20a%20high%20level%20in%20a%20variety%20of%20transformed%20cell%20lines%2C%20where%0Ait%27s%20believed%20to%20contribute%20to%20transformation%20and%20malignancy.%20p53%0Ais%20a%20DNA-binding%20protein%20containing%20transcription%20activation%2C%0ADNA-binding%2C%20and%20oligomerization%20domains.%20It%20is%20postulated%20to%20bind%0Ato%20a%20p53-binding%20site%20and%20activate%20expression%20of%20downstream%20genes%0Athat%20inhibit%20growth%20and/or%20invasion%2C%20and%20thus%20function%20as%20a%20tumor%0Asuppressor.%20Mutants%20of%20p53%20that%20frequently%20occur%20in%20a%20number%20of%0Adifferent%20human%20cancers%20fail%20to%20bind%20the%20consensus%20DNA%20binding%0Asite%2C%20and%20hence%20cause%20the%20loss%20of%20tumor%20suppressor%20activity.%0AAlterations%20of%20this%20gene%20occur%20not%20only%20as%20somatic%20mutations%20in%0Ahuman%20malignancies%2C%20but%20also%20as%20germline%20mutations%20in%20some%0Acancer-prone%20families%20with%20Li-Fraumeni%20syndrome.%20Multiple%20p53%0Avariants%20due%20to%20alternative%20promoters%20and%20multiple%20alternative%0Asplicing%20have%20been%20found.%20These%20variants%20encode%20distinct%20isoforms%2C%0Awhich%20can%20regulate%20p53%20transcriptional%20activity.%20%5Bprovided%20by%0ARefSeq%2C%20Jul%202008%5D.; [line 3]> sequence_version=1;source=Homo%20sapiens%20%28human%29; [line 3]> taxonomy=Eukaryota,Metazoa,Chordata, [line 3]> Craniata,Vertebrata,Euteleostomi, [line 3]> Mammalia,Eutheria,Euarchontoglires, [line 3]> Primates,Haplorrhini,Catarrhini, [line 3]> Hominidae,Homo;keywords=RefSeqGene; [line 3]> references=location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Marcel%2CV.%2C%20Tran%2CP.L.%2C%20Sagne%2CC.%2C%20Martel-Planche%2CG.%2C%20Vaslin%2CL.%2C%20Teulade-Fichou%2CM.P.%2C%20Hall%2CJ.%2C%20Mergny%2CJ.L.%2C%20Hainaut%2CP.%20and%20Van%20Dyck%2CE.%0Atitle%3A%20G-quadruplex%20structures%20in%20TP53%20intron%203%3A%20role%20in%20alternative%20splicing%20and%20in%20production%20of%20p53%20mRNA%20isoforms%0Ajournal%3A%20Carcinogenesis%2032%20%283%29%2C%20271-278%20%282011%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%2021112961%0Acomment%3A, [line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Naidu%2CS.R.%2C%20Love%2CI.M.%2C%20Imbalzano%2CA.N.%2C%20Grossman%2CS.R.%20and%20Androphy%2CE.J.%0Atitle%3A%20The%20SWI/SNF%20chromatin%20remodeling%20subunit%20BRG1%20is%20a%20critical%20regulator%20of%20p53%20necessary%20for%20proliferation%20of%20malignant%20cells%0Ajournal%3A%20Oncogene%2028%20%2827%29%2C%202492-2501%20%282009%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%2019448667%0Acomment%3A, [line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Anczukow%2CO.%2C%20Ware%2CM.D.%2C%20Buisson%2CM.%2C%20Zetoune%2CA.B.%2C%20Stoppa-Lyonnet%2CD.%2C%20Sinilnikova%2CO.M.%20and%20Mazoyer%2CS.%0Atitle%3A%20Does%20the%20nonsense-mediated%20mRNA%20decay%20mechanism%20prevent%20the%20synthesis%20of%20truncated%20BRCA1%2C%20CHK2%2C%20and%20p53%20proteins%3F%0Ajournal%3A%20Hum.%20Mutat.%2029%20%281%29%2C%2065-73%20%282008%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%2017694537%0Acomment%3A, [line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Bourdon%2CJ.C.%0Atitle%3A%20p53%20Family%20isoforms%0Ajournal%3A%20Curr%20Pharm%20Biotechnol%208%20%286%29%2C%20332-336%20%282007%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%2018289041%0Acomment%3A%20Review%20article, [line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Murray-Zmijewski%2CF.%2C%20Lane%2CD.P.%20and%20Bourdon%2CJ.C.%0Atitle%3A%20p53/p63/p73%20isoforms%3A%20an%20orchestra%20of%20isoforms%20to%20harmonise%20cell%20differentiation%20and%20response%20to%20stress%0Ajournal%3A%20Cell%20Death%20Differ.%2013%20%286%29%2C%20962-972%20%282006%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%2016601753%0Acomment%3A%20Review%20article, [line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Flaman%2CJ.M.%2C%20Waridel%2CF.%2C%20Estreicher%2CA.%2C%20Vannier%2CA.%2C%20Limacher%2CJ.M.%2C%20Gilbert%2CD.%2C%20Iggo%2CR.%20and%20Frebourg%2CT.%0Atitle%3A%20The%20human%20tumour%20suppressor%20gene%20p53%20is%20alternatively%20spliced%20in%20normal%20cells%0Ajournal%3A%20Oncogene%2012%20%284%29%2C%20813-818%20%281996%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%208632903%0Acomment%3A, [line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Lamb%2CP.%20and%20Crawford%2CL.%0Atitle%3A%20Characterization%20of%20the%20human%20p53%20gene%0Ajournal%3A%20Mol.%20Cell.%20Biol.%206%20%285%29%2C%201379-1385%20%281986%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%202946935%0Acomment%3A, [line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Harlow%2CE.%2C%20Williamson%2CN.M.%2C%20Ralston%2CR.%2C%20Helfman%2CD.M.%20and%20Adams%2CT.E.%0Atitle%3A%20Molecular%20cloning%20and%20in%20vitro%20expression%20of%20a%20cDNA%20clone%20for%20human%20cellular%20tumor%20antigen%20p53%0Ajournal%3A%20Mol.%20Cell.%20Biol.%205%20%287%29%2C%201601-1610%20%281985%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%203894933%0Acomment%3A, [line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Zakut-Houri%2CR.%2C%20Bienz-Tadmor%2CB.%2C%20Givol%2CD.%20and%20Oren%2CM.%0Atitle%3A%20Human%20p53%20cellular%20tumor%20antigen%3A%20cDNA%20sequence%20and%20expression%20in%20COS%20cells%0Ajournal%3A%20EMBO%20J.%204%20%285%29%2C%201251-1255%20%281985%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%204006916%0Acomment%3A, [line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Matlashewski%2CG.%2C%20Lamb%2CP.%2C%20Pim%2CD.%2C%20Peacock%2CJ.%2C%20Crawford%2CL.%20and%20Benchimol%2CS.%0Atitle%3A%20Isolation%20and%20characterization%20of%20a%20human%20p53%20cDNA%20clone%3A%20expression%20of%20the%20human%20p53%20gene%0Ajournal%3A%20EMBO%20J.%203%20%2813%29%2C%203257-3262%20%281984%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%206396087%0Acomment%3A; [line 3]> accessions=NG_017013;data_file_division=PRI; [line 3]> date=19-FEB-2012;organism=Homo%20sapiens; [line 3]> gi=293651587 [line 4]> NG_017013.1 feature source 1 26144 . + .
[line 4]> db_xref=taxon%3A9606;mol_type=genomic%20DNA; [line 4]> organism=Homo%20sapiens;chromosome=17; [line 4]> map=17p13.1 [line 5]> NG_017013.1 feature gene 1 6475 . - .
[line 5]> note=WD%20repeat%20containing%2C%20antisense%20to%20TP53; [line 5]> db_xref=GeneID%3A55135,HGNC%3A25522, [line 5]> MIM%3A612661;gene=WRAP53;gene_synonym=DKCB3%3B%20TCAB1%3B%20WDR79 [line 6]> NG_017013.1 feature mRNA 2845 6475 . - .
[line 6]> db_xref=GI%3A221136857,GeneID%3A55135, [line 6]> HGNC%3A25522,MIM%3A612661;product=WD%20repeat%20containing%2C%20antisense%20to%20TP53%2C%20transcript%20variant%202; [line 6]> transcript_id=NM_001143990.1;inference=similar%20to%20RNA%20sequence%2C%20mRNA%20%28same%20species%29%3ARefSeq%3ANM_001143990.1; [line 6]> exception=annotated%20by%20transcript%20or%20proteomic%20data; [line 6]> gene=WRAP53;gene_synonym=DKCB3%3B%20TCAB1%3B%20WDR79; [line 6]> ID=NM_001143990.1 [line 7]> NG_017013.1 feature mRNA 2845 2956 . - .
[line 7]> Parent=NM_001143990.1 [line 8]> NG_017013.1 feature mRNA 3224 3322 . - .
[line 8]> Parent=NM_001143990.1 [line 9]> NG_017013.1 feature mRNA 3467 3898 . - .
[line 9]> Parent=NM_001143990.1 [line 10]> NG_017013.1 feature mRNA 6322 6475 . - .
[line 10]> Parent=NM_001143990.1

...

Line Number Error/Warning


4 [ERROR] invalid type (type: source) 7 [ERROR] invalid type pair - check all parents (at line 6; mRNA to mRNA) 12 [ERROR] invalid type pair - check all parents (at line 11; mRNA to mRNA) 17 [ERROR] invalid type pair - check all parents (at line 16; mRNA to mRNA) 22 [ERROR] invalid type pair - check all parents (at line 21; mRNA to mRNA) 26 [ERROR] invalid type pair - check all parents (at line 25; CDS to CDS) 30 [ERROR] invalid type pair - check all parents (at line 29; CDS to CDS) 34 [ERROR] invalid type pair - check all parents (at line 33; CDS to CDS) 38 [ERROR] invalid type pair - check all parents (at line 37; CDS to CDS) 44 [ERROR] invalid type pair - check all parents (at line 43; mRNA to mRNA) 56 [ERROR] invalid type pair - check all parents (at line 55; mRNA to mRNA) 69 [ERROR] invalid type pair - check all parents (at line 68; mRNA to mRNA) 82 [ERROR] invalid type pair - check all parents (at line 81; mRNA to mRNA) 94 [ERROR] invalid type pair - check all parents (at line 93; mRNA to mRNA) 113 [ERROR] invalid type pair - check all parents (at line 112; CDS to CDS) 124 [ERROR] invalid type pair - check all parents (at line 123; CDS to CDS) 135 [ERROR] invalid type pair - check all parents (at line 134; CDS to CDS) 145 [ERROR] invalid type pair - check all parents (at line 144; CDS to CDS) 162 [ERROR] invalid type pair - check all parents (at line 161; CDS to CDS) 171 [ERROR] invalid type pair - check all parents (at line 170; mRNA to mRNA) 180 [ERROR] invalid type pair - check all parents (at line 179; mRNA to mRNA) 189 [ERROR] invalid type pair - check all parents (at line 188; mRNA to mRNA) 206 [ERROR] invalid type pair - check all parents (at line 205; CDS to CDS) 214 [ERROR] invalid type pair - check all parents (at line 213; CDS to CDS) 221 [ERROR] invalid type pair - check all parents (at line 220; CDS to CDS)

 .''.      Hugo A. M. Torres : :' : . '   “Talk is cheap,  -    show me the code. ”  -- L. Torvalds.

On Mon, Mar 12, 2012 at 3:50 PM, A M Torres, Hugo mnemonico@posthocergopropterhoc.net wrote:

Thanks Brad. I will contact them and will let you know asap.

 .''.      Hugo A. M. Torres : :' : . '   “Talk is cheap,  -    show me the code. ”  -- L. Torvalds.

On Mon, Mar 12, 2012 at 3:04 PM, Brad Chapman reply@reply.github.com wrote:

Hugo; I think that the phase is correct but happy to adjust if the GenomeTools folks think otherwise. The GFF spec specifies the phase as 0,1 or 2:

http://www.sequenceontology.org/gff3.shtml

while codon_start from the GenBank file is 1, 2 or 3:

http://www.ddbj.nig.ac.jp/FT/full_index.html#7.2

so I've made the adjustment from 1 to 0 in the GFF output when converting. Let me know if your interaction with the GenomeTools developers indicate I've missed something in the conversion.


Reply to this email directly or view it on GitHub: https://github.com/chapmanb/bcbb/issues/52#issuecomment-4457847

chapmanb commented 12 years ago

Hugo; Thanks for this. The validator is complaining about 'source' not being present in the Sequence Ontology. Mapping GenBank to SO is a fairly large problem. I tried to tackle this a few years back but it ended up being too much work. Here's the progress I made:

http://bcbio.wordpress.com/2008/12/14/standard-ontologies-in-biosql/

Practically, most tools will not enforce this requirement, so being unable to map the entire thing I took the approach of keeping the output GFF similar to the input GenBank. If you wanted to take on a mapping of GenBank to Sequence Ontology I'd be happy to incorporate in.

Is GenomeTools requiring the ontology matches, or just that online validator?

mercutio22 commented 12 years ago

Hi Brad,

Is GenomeTools requiring the ontology matches, or just that online validator?


Hmm, It seems only the validator. GenomeTools seems only to be complaining about that "phase" field.

I have already posted your considerations on their issue tracker. I will let you know what they say when I get a reply. In any case, thanks for taking the time you spent on looking at my problem.

chapmanb commented 12 years ago

Thanks Hugo -- let me know if there ends up being anything I can change on my end to improve the phase information. Hopefully that'll do it and get things working smoothly with GenomeTools. Thanks for your patience with this.

chapmanb commented 11 years ago

Hugo; I'm going to close this to clean up the issues. Hopefully everything was solved on the GenomeTools side. Thanks