chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
603 stars 243 forks source link

Missing column 1 in gff3 data #97

Closed hexylena closed 9 years ago

hexylena commented 9 years ago

I'm trying to handle some data from InterProScan, and I'm having a wee bit of trouble due to what seems to be missing support for gff3 column 1 data. Here's a snippet of the gff3 I'm receiving:

##gff-version 3
##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269
##sequence-region Merlin|266_Merlin_272 1 380
Merlin|266_Merlin_272   .   polypeptide 1   380 .   +   .   md5=6644d7fb6031305c7deb695f447f2970;ID=Merlin|266_Merlin_272
Merlin|266_Merlin_272   ProSitePatterns protein_match   114 130 .   +   .   Name=PS00368;signature_desc=Ribonucleotide reductase small subunit signature.;Target=Merlin|266_Merlin_272 114 130;status=T;ID=match$1_114_130;Ontology_term="GO:0009186","GO:0055114";date=23-02-2015;Dbxref="InterPro:IPR000358","KEGG:00230+1.17.4.1","KEGG:00240+1.17.4.1","KEGG:00480+1.17.4.1","MetaCyc:PWY-6545","MetaCyc:PWY-7184","MetaCyc:PWY-7198","MetaCyc:PWY-7210","MetaCyc:PWY-7220","MetaCyc:PWY-7222","MetaCyc:PWY-7226","MetaCyc:PWY-7227","Reactome:REACT_1698","UniPathway:UPA00326"
Merlin|266_Merlin_272   SUPERFAMILY protein_match   3   343 .   +   .   Name=SSF47240;Target=Merlin|266_Merlin_272 3 343;status=T;ID=match$2_3_343;date=23-02-2015;Dbxref="InterPro:IPR009078"
Merlin|266_Merlin_272   PANTHER protein_match   11  323 .   +   .   Name=PTHR23409;Target=Merlin|266_Merlin_272 11 323;status=T;ID=match$3_11_323;Ontology_term="GO:0009186","GO:0055114";date=23-02-2015;Dbxref="InterPro:IPR000358","KEGG:00230+1.17.4.1","KEGG:00240+1.17.4.1","KEGG:00480+1.17.4.1","MetaCyc:PWY-6545","MetaCyc:PWY-7184","MetaCyc:PWY-7198","MetaCyc:PWY-7210","MetaCyc:PWY-7220","MetaCyc:PWY-7222","MetaCyc:PWY-7226","MetaCyc:PWY-7227","Reactome:REACT_1698","UniPathway:UPA00326"
Merlin|266_Merlin_272   Pfam    protein_match   30  153 3.0E-12 +   .   Name=PF00268;signature_desc=Ribonucleotide reductase, small chain;Target=Merlin|266_Merlin_272 30 153;status=T;ID=match$4_30_153;Ontology_term="GO:0009186","GO:0055114";date=23-02-2015;Dbxref="InterPro:IPR000358","KEGG:00230+1.17.4.1","KEGG:00240+1.17.4.1","KEGG:00480+1.17.4.1","MetaCyc:PWY-6545","MetaCyc:PWY-7184","MetaCyc:PWY-7198","MetaCyc:PWY-7210","MetaCyc:PWY-7220","MetaCyc:PWY-7222","MetaCyc:PWY-7226","MetaCyc:PWY-7227","Reactome:REACT_1698","UniPathway:UPA00326"
Merlin|266_Merlin_272   Pfam    protein_match   201 313 2.6E-9  +   .   Name=PF00268;signature_desc=Ribonucleotide reductase, small chain;Target=Merlin|266_Merlin_272 201 313;status=T;ID=match$4_201_313;Ontology_term="GO:0009186","GO:0055114";date=23-02-2015;Dbxref="InterPro:IPR000358","KEGG:00230+1.17.4.1","KEGG:00240+1.17.4.1","KEGG:00480+1.17.4.1","MetaCyc:PWY-6545","MetaCyc:PWY-7184","MetaCyc:PWY-7198","MetaCyc:PWY-7210","MetaCyc:PWY-7220","MetaCyc:PWY-7222","MetaCyc:PWY-7226","MetaCyc:PWY-7227","Reactome:REACT_1698","UniPathway:UPA00326"
Merlin|266_Merlin_272   Gene3D  protein_match   2   341 2.2E-109    +   .   Name=G3DSA:1.10.620.20;Target=Merlin|266_Merlin_272 2 341;status=T;ID=match$5_2_341;Ontology_term="GO:0016491","GO:0055114";date=23-02-2015;Dbxref="InterPro:IPR012348","Reactome:REACT_1698"
##sequence-region Merlin|179_Merlin_183 1 455
Merlin|179_Merlin_183   .   polypeptide 1   455 .   +   .   md5=cc9f55ac9c0744d7adc1fe519239284b;ID=Merlin|179_Merlin_183
Merlin|179_Merlin_183   ProSiteProfiles protein_match   143 413 .   +   .   Name=PS51199;signature_desc=Superfamily 4 helicase domain profile.;Target=Merlin|179_Merlin_183 143 413;status=T;ID=match$6_143_413;Ontology_term="GO:0003678","GO:0005524","GO:0006260";date=23-02-2015;Dbxref="InterPro:IPR007694"
Merlin|179_Merlin_183   Gene3D  protein_match   143 391 3.1E-28 +   .   Name=G3DSA:3.40.50.300;Target=Merlin|179_Merlin_183 143 391;status=T;ID=match$7_143_391;date=23-02-2015;Dbxref="InterPro:IPR027417"
Merlin|179_Merlin_183   Pfam    protein_match   158 376 1.4E-15 +   .   Name=PF03796;signature_desc=DnaB-like helicase C terminal domain;Target=Merlin|179_Merlin_183 158 376;status=T;ID=match$8_158_376;Ontology_term="GO:0003678","GO:0005524","GO:0006260";date=23-02-2015;Dbxref="InterPro:IPR007694"
Merlin|179_Merlin_183   SUPERFAMILY protein_match   261 396 .   +   .   Name=SSF52540;Target=Merlin|179_Merlin_183 261 396;status=T;ID=match$9_261_396;date=23-02-2015;Dbxref="InterPro:IPR027417"
Merlin|179_Merlin_183   SUPERFAMILY protein_match   149 218 .   +   .   Name=SSF52540;Target=Merlin|179_Merlin_183 149 218;status=T;ID=match$9_149_218;date=23-02-2015;Dbxref="InterPro:IPR027417"

The id attribute of the produced SeqFeature contains ID, which is useful for the match_part but not for the protein_match sections. I really need access to the first column to reliably detect the parent sequence, since the output, as it stands, doesn't use the Parent tag to ensure hierarchy of matches.

Would there be any interest in a PR to add support for this?

chapmanb commented 9 years ago

Eric; Sorry about any confusion. When parsing to Biopython objects, the first column is in the SeqRecord as the id attribute, not in the records SeqFeatures. The idea is that the first column describes the parent sequence for everything gets ordered under that as features. So if you're iterating and need access to this, you can do:

for rec in GFF.parse(your_file):
    for feature in rec.features:
         print rec.id, feature.id

Hope this gives you what you need. for this.

Also, if you're exploring GFF parsing in Python more, it would be worth looking at Ryan's gffutils (https://github.com/daler/gffutils) which builds off of this library but has a lot of additional work.

hexylena commented 9 years ago

Oh, heavens, you're right, that is the record id. I'd completely forgotten that the record had an id attribute. Thanks so much!