chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
604 stars 243 forks source link

GFFparser only keeps first line #79

Closed khughitt closed 10 years ago

khughitt commented 10 years ago

There appears to be a bug in the current GFFParser implementation which causes it to ignore all lines but the first one.

It looks like the issue occurs somewhere in the call to self._lines_to_out_info(line_gen, limit_info, target_lines) in GFFParser.py on line 609.

Before this line is executed, the line generator has all of the lines. The loop is only iterated once, however.

I tested this using the sample gff3 file from the broad institute:

edit_test.fa    .   gene    500 2610    .   +   .   ID=newGene
edit_test.fa    .   mRNA    500 2385    .   +   .   Parent=newGene;Namo=reinhard+did+this;Name=t1%28newGene%29;ID=t1;uri=http%3A//www.yahoo.com
edit_test.fa    .   five_prime_UTR  500 802 .   +   .   Parent=t1
edit_test.fa    .   CDS 803 1012    .   +   .   Parent=t1
etc...

Problem occurs in both Python 2.7.5 and 3.3.2.

chapmanb commented 10 years ago

Keith; The 3 subsequent lines should be nested under the parent gene, so the output will appear as a single feature with subfeatures. The parser handles the nesting for you, so you'll get a tree structure like:

gene
 |--- mRNA
      |--- five_prime_UTR
      |--- CDS

If you don't want nesting, you can use parse_simple and get back a dictionary representation, or pass target_lines=1 if you want a non-nested Biopython record representation. Hope this helps.

khughitt commented 10 years ago

Got it. Thanks for clearing things up! I think I just need to spend some more time familiarizing myself with BioPython.

In case this helps anyone else, here is some code to traverse the above example file:

fp = open('transcripts.gff3')
for gene in GFF.parse(fp):
    for mrna in gene.features:
        for feature in mrna.sub_features:
            print(feature)
fp.close()

Thanks again!