arendsee / fagin

Classify genes using a syntenic filter
GNU General Public License v3.0
0 stars 0 forks source link

gff format problem #1

Closed lijing28101 closed 7 years ago

lijing28101 commented 7 years ago

When I use Glycine max GFF file

Gm01    phytozomev9_0   gene    27355   28320   .   -   .   Name=Glyma.01g000100;ID=1;Alias=Glyma.01g000100.Wm82.a2.v1;agi_genecode=magnesium%20ion%20binding%2Cthiamin%20pyrophosphate%20binding%2Chydro-lyases%2Ccatalytics%2C2-succinyl-5-enolpyruvyl-6-hydroxy-3-cyclohexene-1-carboxylic-acid%20synthases;ancestorIdentifier=Glyma01g00210.v1.1;Dbxref=AGI_GeneCode:AT1G68890.1
Gm01    phytozomev9_0   mRNA    27355   28320   .   -   .   Name=Glyma.01g000100.1;Parent=1;ID=2;Alias=Glyma.01g000100.1.Wm82.a2.v1
Gm01    phytozomev9_0   gene    58975   67527   .   -   .   Name=Glyma.01g000200;ID=8;Alias=Glyma.01g000200.Wm82.a2.v1
Gm01    phytozomev9_0   mRNA    58975   67527   .   -   .   Name=Glyma.01g000200.1;Parent=8;ID=9;Alias=Glyma.01g000200.1.Wm82.a2.v1
Gm01    phytozomev9_0   gene    67770   69968   .   +   .   Name=Glyma.01g000300;ID=21;Alias=Glyma.01g000300.Wm82.a2.v1
Gm01    phytozomev9_0   mRNA    67770   69968   .   +   .   Name=Glyma.01g000300.1;Parent=21;ID=22;Alias=Glyma.01g000300.1.Wm82.a2.v1
Gm01    phytozomev9_0   gene    90152   95947   .   -   .   Name=Glyma.01g000400;ID=25;Dbxref=Pfam:PF04434,Pfam:PF10551,GO:0008270,AGI_GeneCode:AT4G38170.1;arabidopsis_symbol=FRS9;go=zinc%20ion%20binding;agi_genecode=FAR1-related%20sequence%209;ancestorIdentifier=Glyma01g00300.v1.1;pfam=SWIM%20zinc%20finger,MULE%20transposase%20domain;Alias=Glyma.01g000400.Wm82.a2.v1
Gm01    phytozomev9_0   mRNA    90152   95947   .   -   .   Name=Glyma.01g000400.1;Parent=25;ID=26;Alias=Glyma.01g000400.1.Wm82.a2.v1;ancestorIdentifer=Glyma01g00300.1.v1.1
Gm01    phytozomev9_0   gene    90289   91197   .   +   .   Name=Glyma.01g000500;ID=36;Alias=Glyma.01g000500.Wm82.a2.v1
Gm01    phytozomev9_0   mRNA    90289   91197   .   +   .   Name=Glyma.01g000500.1;Parent=36;ID=37;Alias=Glyma.01g000500.1.Wm82.a2.v1

and the command

./2_extract-fasta.sh

I get the error message

Error: Unable to read sequence 'stdin'

I am using Fagin v0.6.0

Here is my system info

Linux xxx 3.10.0-327.22.2.el7.x86_64 #1 SMP EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
arendsee commented 7 years ago

My input scripts assume the attribute column is formatted as .*ID=([^;]+).*Parent=([^;]+).*, where the ID tag is the name of the feature and Parent the name of its parent. This follows the GFF3 specification.

Your GFF file is also formatted according to specifications, but changes the order of ID and Parent. This is a bug on my side, since the tags are not required to be in any particular order.

There is a second issue, though, that your bug report reveals. On my test data, the ID was also the name of the feature (e.g. AT1G1010). Whereas in your files, the ID is an integer key. The GFF specification says only that the ID must be unique to the feature, so both approaches are fine and fagin should be able to deal with them.

I'll code up a better parser. Thanks for the bug report!

arendsee commented 7 years ago

OK, I've implemented a new GFF parser. Try pulling the latest release from master and rerunning the analysis.

arendsee commented 7 years ago

I've made a few more bug fixes and cleaned up the scripts. Now everything works on my system and should have the flexibility to handle yours.

I'll close the issue.