chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
603 stars 243 forks source link

GFF3 parent / id remapping issue. #109

Open hexylena opened 8 years ago

hexylena commented 8 years ago

Just filing it since even though this library doesn't seem to see a lot of activity lately.

Script:

#!/usr/bin/env python
import sys
from BCBio import GFF
from Bio import SeqIO

seq_dict = SeqIO.to_dict(SeqIO.parse(sys.argv[1], "fasta"))
# Parse GFF3 records
for record in GFF.parse(sys.argv[2], base_dict=seq_dict):
    print record.id
    for feature in record.features:
        print feature.id, [x.id for x in feature.sub_features]

Input files:

>ctgA
ACTG
>ctgB
ACTG
##gff-version 3
##sequence-region ctgA 1 17039
ctgA    glimmer gene    39  209 .   +   .   ID=orf00001
ctgA    Glimmer3    CDS 39  209 .   +   0   ID=cds_orf00001;Parent=orf00001
ctgA    glimmer gene    215 637 .   +   .   ID=orf00002
ctgA    Glimmer3    CDS 215 637 .   +   0   ID=cds_orf00002;Parent=orf00002
ctgA    glimmer gene    637 3588    .   +   .   ID=orf00003
ctgA    Glimmer3    CDS 637 3588    .   +   0   ID=cds_orf00003;Parent=orf00003
ctgA    glimmer gene    3873    4085    .   -   .   ID=orf00004
ctgA    Glimmer3    CDS 3873    4085    .   -   0   ID=cds_orf00004;Parent=orf00004
##gff-version 3
##sequence-region ctgB 1 17040
ctgB    glimmer gene    21  209 .   +   .   ID=orf00001
ctgB    Glimmer3    CDS 21  209 .   +   0   ID=cds_orf00001;Parent=orf00001
ctgB    glimmer gene    215 637 .   +   .   ID=orf00002
ctgB    Glimmer3    CDS 215 637 .   +   0   ID=cds_orf00002;Parent=orf00002
ctgB    glimmer gene    637 3588    .   +   .   ID=orf00003
ctgB    Glimmer3    CDS 637 3588    .   +   0   ID=cds_orf00003;Parent=orf00003
ctgB    glimmer gene    3873    4085    .   -   .   ID=orf00004
ctgB    Glimmer3    CDS 3873    4085    .   -   0   ID=cds_orf00004;Parent=orf00004

Actual Output:

ctgA
orf00001 ['cds_orf00001']
orf00002 ['cds_orf00002', 'cds_orf00002']
orf00003 ['cds_orf00003', 'cds_orf00003']
orf00004 ['cds_orf00004', 'cds_orf00004']
ctgB
orf00001_2 ['cds_orf00001']
orf00002 []
orf00003 []
orf00004 []

Expected Output:

ctgA
orf00001 ['cds_orf00001']
orf00002 ['cds_orf00002']
orf00003 ['cds_orf00003']
orf00004 ['cds_orf00004']
ctgB
orf00001 ['cds_orf00001']
orf00002 ['cds_orf00002']
orf00003 ['cds_orf00003']
orf00004 ['cds_orf00004']
chapmanb commented 8 years ago

Eric; Sorry about the issue. Have you tried looking at parsing in gffutils (https://github.com/daler/gffutils)? It's a better representation of the approach used here and is well maintained by Ryan. I've been generally pointing folks there to work on a single approach for parsing GFF in Python. Hope that works for you.

hexylena commented 8 years ago

@chapmanb no problem. Yeah, I saw that comment from you on another issue and started to look into it, but 100% of our tooling is based on your lib, so it'll be a bit of a migration process. And yes, that'll work fine for us as a solution.