KorfLab / SNAP

Gene prediction software
Other
60 stars 17 forks source link

ZFF format, what type of feature to use? CDS is it enough? #5

Closed Juke34 closed 4 years ago

Juke34 commented 4 years ago

On the common protocol to train snap is through MAKER annotation pipeline. They provide a script called maker2zff. Looking at their script I realise that they use only the CDS coordinates to create Esngl, Einit, Eterm, Exon, zff features. What would be your recommendation to better train snap? Using CDS only is enough? Can we use exons only? I checked zoeFeature.h, what about the other features?

Would I get a better training if I provide a zff file with Intron, UTR5, UTR3, Acceptor, Donor, Start, Stop, etc features? Maybe most of them are compute automatically while training (i.e. start, stop, Acceptor, Donor can be deduced by exon coordinates... )

maker2zff defines Esngl, Einit, Eterm, Exon zff features based on CDS gff features, would I get a better training if I define Esngl, Einit, Eterm, Exon based on Exon gff feature and add Coding zff feature to specify which part of the exon is coding?

iankorf commented 4 years ago

CDS. SNAP doesn't really model the non-coding parts.

On Mar 17, 2020, at 3:17 AM, Jacques Dainat notifications@github.com wrote:

On the common protocol to train snap is through MAKER annotation pipeline. The provide a script called maker2zff. Looking at their script I realise that instead to use the exons coordinates they use the CDS coordinates. What would be your recommendation to better train snap? using CDS or exons?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Juke34 commented 4 years ago

Does the separator field matter in the zff file? Should it be space or tabulation?

iankorf commented 4 years ago

I don't think it matters, but tab always looks better.

On Mar 17, 2020, at 8:56 AM, Jacques Dainat notifications@github.com wrote:

Does the separator field matter in the zff file? Should it be space or tabulation?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Juke34 commented 4 years ago

A last remark, I think you don't mention in the readme that genome.dna and genome.ann must be sorted by sequence identifier in the same order. I did a first try where my files were not sorted in the same order and got plenty of error messages. Now I sorted them in the same way everything goes fine

iankorf commented 4 years ago

It processes them one at a time. Back when SNAP was first developed, there was no way you could all the chromosomes and annotation in at once.

On Mar 17, 2020, at 9:06 AM, Jacques Dainat notifications@github.com wrote:

A last remark, I think you don't mention in the readme that genome.dna and genome.ann must be sorted by sequence identifier in the same order. I did a first try where my files were not sorted in the same order and got plenty of error messages.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.