FEA: gff3 support for reference genomes

daz10000 commented 5 years ago

The reference gene format used by the GSL compiler is considered a legacy format.. It was originally derived from an SGD export and even that was obscure. We further complicated things by subtracting one from every coordinates making it zero-based which is unusual in user facing biology systems.

One proposal is to replace this legacy format with the GFF3 standard which is a relatively common format in biology. It suffers from some standardisation issues but should allow rich expression of gene structure information with coordinates, be more interoperable with other bioinformatics tools and also allow combining the fasta sequence and the coordinates into a single file optionally

In order to implement these changes we would need to provide a

GFF3 parser that can replace the current ref format loader
A tool to convert existing ref files into the GFF3 format
optionally a tool for validating GFF3 files, since they can be non-standard especially with respect to where the gene identifiers are stored

It might also be desirable to make the format loader a configurable option, to enable future formats. It's questionable whether we should retain support for the existing reference files or just forget about them as a bad memory and encourage migration to GFF3 ;)

I know there is an existing F sharp implementation of a GFF3 parser and if that were released it would save some effort. I have code for generating GFF3 files and could quickly write the conversion and validation tool.

Do we wish to combine the DNA sequence and cordons into a single file? The advantages are that there is just one file floating around with the whole genome, and possibly slightly faster loading. You can also ensure that the coordinates and reference sequence stay together. The disadvantage is that it's harder to get a copy of just the fasta file for other analyses, although a conversion tool for that would also be possible.

In terms of interface with the GSL compiler, we could initially create a loader that plugs into the existing Feature data structure, so the majority of the compiler would be untouched by this upgrade. It would be desirable to expand the data structure to capture things like intron/exon coordinates (note we have largely lost these from existing ref files although in theory they could be there). This would enable more intelligent processing of things like open reading frames in the future, but that's a bigger change to the core compiler.

chubbysilk commented 5 years ago

Yes, switching to GFF3 sounds like a good idea. Although it's sometimes difficult to use, at least GFF3 has official file format specs (https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md) and is fairly well accepted by the community. Also, it would be nice to eliminate the possibility of 0-vs-1 based indexing errors.

I am aware of an F# GFF3 -> Feat.tab file parser, but not the other way around. Scripts to do GFF3 <-> Feat.tab definitely exist in python as well.

Keeping the annotations and sequence separate is preferable, as the files are then easier to use downstream with other applications. However, if implementation in GSL is significantly easier with a combined GFF3, it's no big deal...

jlerman44 commented 5 years ago

I prefer fasta (sequence only) and gff3 (annotations only)

chrismacklin commented 5 years ago

My vote is to keep the sequence and annotations separate, to match the expectations of most other tools.

daz10000 commented 5 years ago

Got it, so separate wins. That's cool. More questions

the GFF3 can represent a richer set of information than we store in the Feature data sturcture (Amyris.Bio / Gslc), and it would be nice to at least entertain the ability to capture intron/exon structure, even if we throw it away for now internally inside the compiler. We can either upgrade the Feature record or introduce a newly named data structure. Either way the compiler would need to use the expanded data structure and have an easy way to get the same basic information out of it to maintain backward compatibility. Future features can actually deal with the introns but it would be good to load them.
We have delegated the code for loading to the Amyris.Bio Library which I think makes sense, so I'm assuming we would want the new loader to live there, with anything really specific to the GSL compiler inside the library here.
Should we maintain backward compatibility of the compiler can still work off the old file format or should we really try to eradicate it and forget we ever created it (just leaving a tool to do the conversion).
Related to 3, is it worth going to the trouble of building in a general plug-in framework for reference genome loaders (this would help us support the two formats side-by-side and make it easier to add in custom genome sources which would be useful)

I have code for generating GFF3s in F#, which I am happy to provide. It would certainly be helpful to have the parser or some form of it put into Amyris.Bio. I can write the conversion tool easily enough and would be great to have some help with plugging it into the compiler and testing. If I recall correctly, there are some variants of the current ref file format. You can get a taste of it from the existing loading code . If people have large collections of existing reference genomes, they will want to test the conversion script carefully. Maybe I can make the conversion bidirectional so we can round trip the data and do a comparison.

daz10000 commented 4 years ago

Just bumping this up again - looking at GFF3 format for some other things and would be keen to implement this. Before we write a gff3 parser again, would you be interesting in opening up the Amyris one? If not, would you still be interest in taking one from us and putting it into Amyris.Bio if you don't want to release anything? I estimate it's about a morning worth of coffee to write one so no big deal either way but worth asking before duplicating

Amyris / GslCore

FEA: gff3 support for reference genomes #25