chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
603 stars 243 forks source link

CDS entries in GFF3 file are not merged to a CompoundLocation #95

Open mikdur opened 9 years ago

mikdur commented 9 years ago

Given a gene that looks something like this in GFF3 notion:

##gff-version 3
scf_001 maker   gene    36837   38790   .       +       .       ID=BN869_G00000007;Name=BN869_G00000007;
scf_001 maker   mRNA    36837   38790   .       +       .       ID=BN869_T00000007_1;Parent=BN869_G00000007;Name=BN869_T00000007_1;
scf_001 maker   exon    36837   37491   .       +       .       ID=BN869_T00000007_1:exon:0;Parent=BN869_T00000007_1;
scf_001 maker   exon    37547   38790   .       +       .       ID=BN869_T00000007_1:exon:1;Parent=BN869_T00000007_1;
scf_001 maker   CDS     36837   37491   .       +       0       ID=BN869_T00000007_1:cds;Parent=BN869_T00000007_1;
scf_001 maker   CDS     37547   38790   .       +       2       ID=BN869_T00000007_1:cds;Parent=BN869_T00000007_1;

The GFF parser fails to join the two CDSs with the same ID into a single feature with a CompoundLocation. The result of this is that GenBank och EMBL files produced when merging (and flattening) GFF3 annotations get multiple CDSs where the CDS position instead should be a join, eg:

FT   CDS             join(36837..37491,37547..38790)
bgruening commented 9 years ago

@chapmanb is there an easy solution for this? I stumbled over this as well, as I tried to integrate protein sequences to CDS records.

chapmanb commented 9 years ago

Björn and Mikael; Sorry about leaving this for so long. I've been meaning to tackle it forever. Have you tried using GFFutils:

https://github.com/daler/gffutils

I've been pointing everyone at Ryan's work as it's better and more up to date than this library. The goal has been to merge any missing functionality this library has there. Hopefully it'll handle your case better.

bgruening commented 9 years ago

@chapmanb yes I'm developing currently some Galaxy integration for gffutils, but this is lacking the conversion features as far as I know. You can not convert a gff-sqlite to genbank, isn't it?

mikdur commented 9 years ago

I think I have some code that does the merge, albeit maybe not in an optimal way. I'll check it and see if it fits to be merged into a suitable place.

Cheers, Mikael


Sent from a crippled computer (a.k.a a phone)

18 jun 2015 kl. 17:40 skrev Brad Chapman notifications@github.com<mailto:notifications@github.com>:

Björn and Mikael; Sorry about leaving this for so long. I've been meaning to tackle it forever. Have you tried using GFFutils:

https://github.com/daler/gffutils

I've been pointing everyone at Ryan's work as it's better and more up to date than this library. The goal has been to merge any missing functionality this library has there. Hopefully it'll handle your case better.

Reply to this email directly or view it on GitHubhttps://github.com/chapmanb/bcbb/issues/95#issuecomment-113196433.

bgruening commented 9 years ago

@mikdur this would be great!