chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
606 stars 243 forks source link

flybase GFF parsing error #37

Closed vineeth-s closed 13 years ago

vineeth-s commented 13 years ago

In lines 773-775 of GFFParser.py in gff/BCBio/GFF/GFFParser.py if line.strip() and line.strip()[0] != "#": parts = [p.strip() for p in line.split('\t')] assert len(parts) == 9, line a tab is not expected in any of the fields, but the flybase GFF files do have the occasional tab in the names or descriptions of genes in the last field, and then the parser breaks.

Without loss of generality, could this not be modified to -

            if len(parts) > 9: 
                    temp_parts = parts[0:8]
                    last_part = " ".join(temp_parts[8:])
                    temp_parts.append(last_part)
                    parts = temp_parts

?

I have made this change in my local copy, and this seems to work fine.

chapmanb commented 13 years ago

Thanks also for reporting this. Would you be able to point me to a FlyBase GFF file that has this problem? I'd like to include it in the test suite along with the fix. Thanks much,

Brad

vineeth-s commented 13 years ago

Hi Brad,

This is from ftp://ftp.flybase.org/genomes/Drosophila_melanogaster/dmel_r5.38_FB2011_06/gff/dmel-all-r5.38.gff.gz

The offending line is : 2R FlyBase gene 12212318 12224142 . - . ID=FBgn0262511;Name=Vha44;fullname=Vacuolar H[+] ATPase 44kD C subunit;Alias=FBgn0020611,FBgn0065466,vha44,CG8048,vacuolar ATPase C-subunit,C subunit,V-ATPase,l(2)6072,V-ATPase C subunit,6072,Vacuolar H+ ATPase 44kD C subunit,lethal (2) SH1339,l(2)SH1339,l(2)SH2 1339,Vacuolar H[+] ATPase 44 kDa subunit;Ontology_term=SO:0000010,SO:0000087,GO:0015992,GO:0008553,GO:0000221,GO:0007557,GO:0005886,SO:0000704,GO:0015991;Dbxref=FlyBase:FBan0008048,FlyBase_Annotation_IDs:CG8048,GB_protein:AAF58011,GB_protein:ACL83133,GB_protein:AAM68515,GB_protein:AAF58012,GB_protein:AAF58013,GB:AA392603,GB:AA699128,GB:AA801907,GB:AC009356,GB:AF006646,GB:AF006655,GB_protein:AAB62571,GB:AI946790,GB:AW941461,GB:AX093887,GB:AY061038,GB_protein:AAL28586,GB:AY102676,GB_protein:AAM27505,GB:BG638027,GB:BH615050,GB:BH854641,GB:BI580507,GB:BT015974,GB_protein:AAV36859,GB:CL705904,GB:CZ489279,UniProt/Swiss-Prot:Q9V7N5,INTERPRO:IPR004907,EntrezGene:36826,GB:AB082463,GenomeRNAi:36826;gbunit=AE013599;derived_computed_cyto=53B5-53C1

There is a tab in "ATPase 44 kDa subunit"

Though this seems to have been corrected in r5.39 available here : ftp://ftp.flybase.org/genomes/Drosophila_melanogaster/dmel_r5.39_FB2011_07/gff/dmel-all-r5.39.gff.gz

I guess it is your call if you want to take care of database specific idiosyncrasies

Vineeth

chapmanb commented 13 years ago

Vineeth; If this is fixed in the FlyBase GFF then I'd prefer to leave it out. Extra tabs are definitely off-specification, and adding in the extra checks and joins will slow down parsing for correct cases. In general I've tried to add fixes for persistent off-spec problems, but if this was only temporary the best bet is to use the fixed GFF. Thanks again for bringing this up,

Brad