Open cmungall opened 5 years ago
@barrymoore has a validator in https://github.com/The-Sequence-Ontology/GAL/ - is this used in production?
Code for traversing SO graph:
@cmungall at one point the RefSeq group was using the GAL based GFF3 validator for their production GFF3 validation. I'm not sure if there are others using it or if RefSeq is still using it.
Thanks! Do you have a contact?
On Fri, Mar 22, 2019 at 4:37 AM Barry Moore notifications@github.com wrote:
@cmungall https://github.com/cmungall at one point the RefSeq group was using the GAL based GFF3 validator for their production GFF3 validation. I'm not sure if there are others using it or if RefSeq is still using it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/The-Sequence-Ontology/Specifications/issues/18#issuecomment-475588055, or mute the thread https://github.com/notifications/unsubscribe-auth/AADGOSwCHIzX5dF40TkxLHQcUW8yuuAZks5vZMBggaJpZM4bOKxd .
Hi Chris,
Terrence Murphy murphyte@ncbi.nlm.nih.govmailto:murphyte@ncbi.nlm.nih.gov was the person at NCBI that I interacted with, but it’s probably been 5+ years ago. He provided several suggestions for updates that were incorporated into the validator. He’s still at NCBI as far as I know, but not sure if he’s still involved with RefSeq GFF3 validation.
Barry
On Mar 22, 2019, at 6:04 PM, Chris Mungall notifications@github.com<mailto:notifications@github.com> wrote:
Thanks! Do you have a contact?
On Fri, Mar 22, 2019 at 4:37 AM Barry Moore notifications@github.com<mailto:notifications@github.com> wrote:
@cmungall https://github.com/cmungall at one point the RefSeq group was using the GAL based GFF3 validator for their production GFF3 validation. I'm not sure if there are others using it or if RefSeq is still using it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/The-Sequence-Ontology/Specifications/issues/18#issuecomment-475588055, or mute the thread https://github.com/notifications/unsubscribe-auth/AADGOSwCHIzX5dF40TkxLHQcUW8yuuAZks5vZMBggaJpZM4bOKxd .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/The-Sequence-Ontology/Specifications/issues/18#issuecomment-475700444, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACI4235KP1mZ2fP46fLrcMAQCWvS8EW5ks5vZQ0MgaJpZM4bOKxd.
I haven't found any of the GFF3 validators I know of to be definitive or complete. I haven't kept track of which validators have which problems, but I've observed:
For using GFF3 for annotation submission, we primarily rely on converting it to ASN.1 as we see fit (including allowing deviations from the spec, like CDS rows with no or different IDs) and running our standard ASN.1 validation code on the result, which is much more extensive than possible with GFF3 alone (in part because it can analyze a feature vs. its sequence, which no GFF3 validator I know of can do). We do point users to a couple of validators to do a preliminary check: https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/
For GFF3 that we output, I've occasionally done some bulk analyses with different validators, but we don't routinely run any of the validators on everything that we produce.
Defining a set of tests expected for any validator beyond the most rudimentary would likely re-expose some of the issues with the spec that were never resolved (trans-splicing, anyone?). We'd need to define a better approach for those issues before we could make much progress.
Thanks Terrence, this is really useful. Looks like some more work in this area could be beneficial, in refining the spec and creating a fully complete scalable reference validator. This could perhaps be abstracted above format specifics of GFF3 and include the NCBI ASN.1 representation as well as FALDO GFF (cc @JervenBolleman). But not sure who has resources to work on this!
In the interim it leaves SO in a bit of an odd state, hard to evolve it without knowing if changes will result in false positive or false negative changes in any given validator (some kind of containerized workflow / integration test with a large bank of sample GFF3s would be super-useful here)
@cmungall there was a GFF3 working group started at one point (apart from me @barrymoore and @murphyte were also on this I believe?), but it has been several years. Maybe this or something similar is needed again?
I do think it would be very useful to have a repo with example common cases from the spec as well as more problematic 'edge' cases, then build out tests from that as Terrance mentioned. This should help point to problem areas in the current specification, maybe hone in on an 'official' validation tool, and lead to improvements. Similar approaches seem to have worked with other formats with specifications, e.g SAM/BAM, VCF, CWL, etc.
@cmungall Apollo does some basic "validation" when it tries to upload GFF3 by trying to import the structure into a reasonable internal model (which is SO compliant) versus one that is using the SO explicitly for adherence. I think that is what most validation (e.g., Tripal) does as well, though I'm not familar with how Chado enforces structure, but I'm going to guess it doesn't do validation either. That being said, I like the idea, but it definitely works better for an RDF structure.
Other validator / parsers:
To clarify this what I was saying, currently groups run exported GFF3 through a several validation and merging scripts.
I've been asked to write validators within Apollo, but those validate against the model (which we do a lot of already), which in turn should validate the GFF3 on export (we have a test that re-imports to validate round-trip, as well).
The issue is that a lot of validation steps are very group-dependent (checking status fields, export tags, etc.), though there are some general ones we could add.
Reference to the NCBI valid GFF3 is here: https://github.com/GMOD/Apollo/issues/565
Some of the issues that we've seen are:
<attribute>=;
. But I wouldn't be surprised if some of the existing validators object to that.DBTAG:ID
format (which allows some checking for unexpected usage of commas). I don't know if all the validators check for that.Bottom line: the flexibility of the GFF3 format means there aren't many absolute validation rules. But there are a set of best practices and other issues that can be reported as warnings.
As part of the AgBioData consortium, we did a bit of a survey of GFF parsing and validation available. Taking comments on this ticket into account, this tool https://github.com/NAL-i5K/gff3toolkit does create warnings, etc on this flexible format's best practices/specification.
https://gff3-py.readthedocs.io/en/latest/readme.html#features
http://genometools.org/cgi-bin/gff3validator.cgi
https://github.com/NAL-i5K/gff3toolkit - AgBioData parser/validator.
modENCODE validator
https://sourceforge.net/p/gmod/svn/HEAD/tree/gff-validator/trunk/validate_gff3.pl
http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread
BioFSharp
https://github.com/daler/gffutils
https://pypi.org/project/bcbio-gff/
One more for possible consideration is this: https://github.com/genometools/genometools/wiki/speck-User-manual
it's different than the genometools gff3-validator @sierra-moxon has in the list above, and kind of interesting in that it allows extensibility via a DSL. I've only explored it lightly, but their examples seem to work as advertised. Might be a nice approach if different "dialects" need to be supported.
I know this thread has trailed off, but I wanted to add to the discussion because I think there's still a need for some sort of "official" GFF validator, test suite, etc. -- at least something where if I'm developing a new bioinformatic tool that can output as GFF, I can have something to check my tool's outputs against, especially if they are structural annotations/gene models as opposed to some arbitrary feature/functional annotation. Hopefully some of these links help push the discussion forward!
https://github.com/BioJulia/BioFmtSpecimens - Collection of real-world bioinformatics file format specimens to test against
https://github.com/cmdcolin/oddgenes - Collection of "odd genes" and edge cases
https://gfacs.readthedocs.io/en/latest/index.html - gFACs: Gene Filtering, Analysis, and Conversion
https://agat.readthedocs.io/en/latest/agat_how_does_it_work.html - AGAT: Another GTF/GFF Analysis Toolkit
https://easy-import.readme.io/docs/repairing-gff - easy-import
https://github.com/TAMU-CPT/CPT_GffParser - a BioPython-compatible library for parsing and fixing GFF data
https://bioconvert.readthedocs.io/en/main/# - 'BioConvert': a collaborative project to facilitate the interconversion of life science data formats
https://github.com/jorvis/biocode - 'biocode': a collection of bioinformatics code libraries and scripts (see gff
subdirectory)
Great to see there is still a lot of interest in this
I am planning to create a LinkML schema for GFF3. This would have a lot of advantages:
This could serve as a reference against which different validators could indicate conformance, and also directly as a validator
That sounds nice! Glad to see that it is YAML-based and not something complex or particularly domain-specific.
However, can you clarify the concept of "profiles"? To my ear, it sounds like it would involve different rule-sets for GFFs from different organisms. If that's the case, (IMO) that seems like something that would be a minor step in the wrong direction (though that discussion may be for another thread).
The spec points here: https://github.com/modENCODE-DCC/validator/blob/master/new_gff_validator.pl
This is 7 year old perl code
The SO wiki has: http://www.sequenceontology.org/so_wiki/index.php/GFF3_Validation_Tools
which has GFFO (not in use?), FALDO (not really a validator) and the modENCODE validator. The modENCODE validator link doesn't work. But it seems to be this code: https://github.com/genometools/genometools which is in C
Reciprocal ticket: https://github.com/genometools/genometools/issues/910
There is a question here: https://www.biostars.org/p/177319/ indicates another validator here, this one in Python: http://www.raetschlab.org/suppl/gff-tools
Which of these is supported? Is the behavior identical? What expectations does each have on the SO obo file?
I don't think the spec should link to specific validators. However, the spec should indicate the expected behavior of the validator. This could be modularized into different checks, and we could group checks into profiles. E.g. some validators may only validate a basic syntactic profile. Others could validate a sofa profile, where we check that the type column maps to a SO ID.
Understanding how validators use relationships is important for maintenance of SO: https://github.com/The-Sequence-Ontology/SO-Ontologies/issues/465
There could be a validator registry separate from the spec, and defined conformance tests for the validators