chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
610 stars 243 forks source link

GFF: parse_simple fails in some cases #80

Closed khughitt closed 10 years ago

khughitt commented 10 years ago

To reproduce:

from BCBio import GFF

# http://www.broadinstitute.org/annotation/gebo/help/data/gff3/transcripts.gff3
input_file = 'transcripts.gff3'

list(GFF.parse_simple(open(input_file)))

Error output:

KeyError                                  Traceback (most recent call last)
<ipython-input-10-32ebcd95dc0c> in <module>()
----> 1 list(GFF.parse_simple(open(infile)))

/home/keith/software/bcbb/gff/BCBio/GFF/GFFParser.pyc in parse_simple(gff_files, limit_info)
    721     parser = GFFParser()
    722     for rec in parser.parse_simple(gff_files, limit_info=limit_info):
--> 723         yield rec["child"][0]
    724 
    725 def _file_or_handle(fn):

KeyError: 'child'

I checked the results of parser.parse_simple(gff_files, limit_info=limit_info) and there are some parent entries that have no child key.

E.g. For the above file:

[{'parent': [{'id': 'newGene',
    'is_gff2': False,
    'location': [499, 2610],
    'quals': {'ID': ['newGene']},
    'rec_id': 'edit_test.fa',
    'strand': 1,
    'type': 'gene'}]},
 {'child': [{'id': 't1',
    'is_gff2': False,
    'location': [499, 2385],
    'quals': {'ID': ['t1'],
     'Name': ['t1(newGene)'],
     'Namo': ['reinhard+did+this'],
     'Parent': ['newGene'],
     'uri': ['http://www.yahoo.com']},
    'rec_id': 'edit_test.fa',
    'strand': 1,
    'type': 'mRNA'}]},
   ...
]

If you want to treat the parent and child nodes the same, a simple fix would be:

yield rec.get('child', rec.get('parent'))[0]

Hopefully this time it is an actual issue and not just a misunderstanding on my part :)

If the above solution is appropriate, I would be glad to submit a patch.

chapmanb commented 10 years ago

Keith; Thanks for reporting this problem. Your solution is exactly right and I generalized it a bit to also handle directive lines and push the fix. Thanks again for catching this one and let me know if you run into any other problems at all.