The-Sequence-Ontology / Specifications

GFF and GVF specification documents
209 stars 91 forks source link

Is there a single canonical validator, or multiple implementations? #18

Open cmungall opened 5 years ago

cmungall commented 5 years ago

The spec points here: https://github.com/modENCODE-DCC/validator/blob/master/new_gff_validator.pl

This is 7 year old perl code

The SO wiki has: http://www.sequenceontology.org/so_wiki/index.php/GFF3_Validation_Tools

which has GFFO (not in use?), FALDO (not really a validator) and the modENCODE validator. The modENCODE validator link doesn't work. But it seems to be this code: https://github.com/genometools/genometools which is in C

Reciprocal ticket: https://github.com/genometools/genometools/issues/910

There is a question here: https://www.biostars.org/p/177319/ indicates another validator here, this one in Python: http://www.raetschlab.org/suppl/gff-tools

Which of these is supported? Is the behavior identical? What expectations does each have on the SO obo file?

I don't think the spec should link to specific validators. However, the spec should indicate the expected behavior of the validator. This could be modularized into different checks, and we could group checks into profiles. E.g. some validators may only validate a basic syntactic profile. Others could validate a sofa profile, where we check that the type column maps to a SO ID.

Understanding how validators use relationships is important for maintenance of SO: https://github.com/The-Sequence-Ontology/SO-Ontologies/issues/465

There could be a validator registry separate from the spec, and defined conformance tests for the validators

cmungall commented 5 years ago

@barrymoore has a validator in https://github.com/The-Sequence-Ontology/GAL/ - is this used in production?

Code for traversing SO graph:

https://github.com/The-Sequence-Ontology/GAL/blob/0efe59c5db26e46c7ca4d1d472e54c5f5dd8fc4f/bin/gff3_validator#L632-L655

barrymoore commented 5 years ago

@cmungall at one point the RefSeq group was using the GAL based GFF3 validator for their production GFF3 validation. I'm not sure if there are others using it or if RefSeq is still using it.

cmungall commented 5 years ago

Thanks! Do you have a contact?

On Fri, Mar 22, 2019 at 4:37 AM Barry Moore notifications@github.com wrote:

@cmungall https://github.com/cmungall at one point the RefSeq group was using the GAL based GFF3 validator for their production GFF3 validation. I'm not sure if there are others using it or if RefSeq is still using it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/The-Sequence-Ontology/Specifications/issues/18#issuecomment-475588055, or mute the thread https://github.com/notifications/unsubscribe-auth/AADGOSwCHIzX5dF40TkxLHQcUW8yuuAZks5vZMBggaJpZM4bOKxd .

barrymoore commented 5 years ago

Hi Chris,

Terrence Murphy murphyte@ncbi.nlm.nih.govmailto:murphyte@ncbi.nlm.nih.gov was the person at NCBI that I interacted with, but it’s probably been 5+ years ago. He provided several suggestions for updates that were incorporated into the validator. He’s still at NCBI as far as I know, but not sure if he’s still involved with RefSeq GFF3 validation.

Barry

On Mar 22, 2019, at 6:04 PM, Chris Mungall notifications@github.com<mailto:notifications@github.com> wrote:

Thanks! Do you have a contact?

On Fri, Mar 22, 2019 at 4:37 AM Barry Moore notifications@github.com<mailto:notifications@github.com> wrote:

@cmungall https://github.com/cmungall at one point the RefSeq group was using the GAL based GFF3 validator for their production GFF3 validation. I'm not sure if there are others using it or if RefSeq is still using it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/The-Sequence-Ontology/Specifications/issues/18#issuecomment-475588055, or mute the thread https://github.com/notifications/unsubscribe-auth/AADGOSwCHIzX5dF40TkxLHQcUW8yuuAZks5vZMBggaJpZM4bOKxd .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/The-Sequence-Ontology/Specifications/issues/18#issuecomment-475700444, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACI4235KP1mZ2fP46fLrcMAQCWvS8EW5ks5vZQ0MgaJpZM4bOKxd.

murphyte commented 5 years ago

I haven't found any of the GFF3 validators I know of to be definitive or complete. I haven't kept track of which validators have which problems, but I've observed:

  1. missing SO terms
  2. imposing requirements that do commonly show up in code consuming GFF3, but aren't actually part of the spec. In particular, single features spanning multiple rows with the same ID, other than CDS, are allowed by the spec but some validators report as an error.
  3. Not testing as much as we'd like. Part of this is from an overly flexible spec which makes it hard to define what is 'valid'
  4. unacceptable performance (the old modENCODE validator in particular couldn't handle anything larger than a bacteria genome's worth of annotation)

For using GFF3 for annotation submission, we primarily rely on converting it to ASN.1 as we see fit (including allowing deviations from the spec, like CDS rows with no or different IDs) and running our standard ASN.1 validation code on the result, which is much more extensive than possible with GFF3 alone (in part because it can analyze a feature vs. its sequence, which no GFF3 validator I know of can do). We do point users to a couple of validators to do a preliminary check: https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/

For GFF3 that we output, I've occasionally done some bulk analyses with different validators, but we don't routinely run any of the validators on everything that we produce.

Defining a set of tests expected for any validator beyond the most rudimentary would likely re-expose some of the issues with the spec that were never resolved (trans-splicing, anyone?). We'd need to define a better approach for those issues before we could make much progress.

cmungall commented 5 years ago

Thanks Terrence, this is really useful. Looks like some more work in this area could be beneficial, in refining the spec and creating a fully complete scalable reference validator. This could perhaps be abstracted above format specifics of GFF3 and include the NCBI ASN.1 representation as well as FALDO GFF (cc @JervenBolleman). But not sure who has resources to work on this!

In the interim it leaves SO in a bit of an odd state, hard to evolve it without knowing if changes will result in false positive or false negative changes in any given validator (some kind of containerized workflow / integration test with a large bank of sample GFF3s would be super-useful here)

cjfields commented 5 years ago

@cmungall there was a GFF3 working group started at one point (apart from me @barrymoore and @murphyte were also on this I believe?), but it has been several years. Maybe this or something similar is needed again?

I do think it would be very useful to have a repo with example common cases from the spec as well as more problematic 'edge' cases, then build out tests from that as Terrance mentioned. This should help point to problem areas in the current specification, maybe hone in on an 'official' validation tool, and lead to improvements. Similar approaches seem to have worked with other formats with specifications, e.g SAM/BAM, VCF, CWL, etc.

nathandunn commented 5 years ago

@cmungall Apollo does some basic "validation" when it tries to upload GFF3 by trying to import the structure into a reasonable internal model (which is SO compliant) versus one that is using the SO explicitly for adherence. I think that is what most validation (e.g., Tripal) does as well, though I'm not familar with how Chado enforces structure, but I'm going to guess it doesn't do validation either. That being said, I like the idea, but it definitely works better for an RDF structure.

Other validator / parsers:

nathandunn commented 5 years ago

To clarify this what I was saying, currently groups run exported GFF3 through a several validation and merging scripts.

I've been asked to write validators within Apollo, but those validate against the model (which we do a lot of already), which in turn should validate the GFF3 on export (we have a test that re-imports to validate round-trip, as well).

The issue is that a lot of validation steps are very group-dependent (checking status fields, export tags, etc.), though there are some general ones we could add.

Reference to the NCBI valid GFF3 is here: https://github.com/GMOD/Apollo/issues/565

murphyte commented 5 years ago

Some of the issues that we've seen are:

  1. CDS features incompatible with exon features of the same parent mRNA. A validation check would need to allow for ribosomal slippage
  2. child features outside the range of their parent. This isn't invalid per se by the GFF3 specs, and there are definitely examples within INSDC annotations where it doesn't hold true. But I think it should hold true for certain types of features (e.g. a CDS should fall within the range of its parent mRNA, and same for mRNA within its parent gene).
  3. IDs with discontinuous features. Features spanning multiple rows with the same ID (like how multi-exon CDSes are shown in the current specs) are technically allowed for any feature type. But some code (e..g Cufflinks) expect IDs to be unique. IIRC, One of the existing validators checks for uniqueness, and I think another one does allow the same ID on multiple rows, but I believe (a) they must have the same feature type (which I think should be part of the spec), and (b) I think it checks that they're on the same seqid. The latter also isn't formally part of the specs, although I suspect it's expected by most code. There are some old notes at: http://www.sequenceontology.org/so_wiki/index.php/Discontinuous_features
  4. the GFF3 spec doesn't explicitly say how to specify a NULL value. The only sensible way within the spec is using <attribute>=;. But I wouldn't be surprised if some of the existing validators object to that.
  5. files with ;;. This is more of a test for reader implementations to be sure they tolerate it, although I could see having it as a warning in a validator.
  6. validating usage of commas in attributes. We've seen cases where commas in attribute values aren't properly encoded, raising the question of whether they're delimiting multiple values or just an encoding error. A validator could report the set of attributes where commas are observed in the values as a sanity check.
  7. Dbxrefs values. These can be validated to the expected DBTAG:ID format (which allows some checking for unexpected usage of commas). I don't know if all the validators check for that.
  8. Are SO terms in column 3 case sensitive? This is another area where the GFF3 spec and SO are ambiguous.
  9. CDS IDs. As discussed elsewhere, there are two styles of CDS IDs in GFF3: a) those like in the spec, where multiple CDS rows have the same ID, and b) those where multiple CDS rows have distinct IDs (somewhat like exon). We've resorted to taking an approach where CDS rows with the same mRNA parent are all considered to be part of the same CDS in order to handle (b), which does prevent truly annotating multiple CDS features on the same mRNA, but that's incredibly rare and not compatible with lots of code so it's best to discourage anyway. This could be reported as a warning.

Bottom line: the flexibility of the GFF3 format means there aren't many absolute validation rules. But there are a set of best practices and other issues that can be reported as warnings.

sierra-moxon commented 3 years ago

As part of the AgBioData consortium, we did a bit of a survey of GFF parsing and validation available. Taking comments on this ticket into account, this tool https://github.com/NAL-i5K/gff3toolkit does create warnings, etc on this flexible format's best practices/specification.

https://gff3-py.readthedocs.io/en/latest/readme.html#features

http://genometools.org/cgi-bin/gff3validator.cgi

https://github.com/NAL-i5K/gff3toolkit - AgBioData parser/validator.

modENCODE validator

https://sourceforge.net/p/gmod/svn/HEAD/tree/gff-validator/trunk/validate_gff3.pl

http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread

BioFSharp

https://github.com/daler/gffutils

https://pypi.org/project/bcbio-gff/

adf-ncgr commented 3 years ago

One more for possible consideration is this: https://github.com/genometools/genometools/wiki/speck-User-manual

it's different than the genometools gff3-validator @sierra-moxon has in the list above, and kind of interesting in that it allows extensibility via a DSL. I've only explored it lightly, but their examples seem to work as advertised. Might be a nice approach if different "dialects" need to be supported.

dtdoering commented 6 months ago

I know this thread has trailed off, but I wanted to add to the discussion because I think there's still a need for some sort of "official" GFF validator, test suite, etc. -- at least something where if I'm developing a new bioinformatic tool that can output as GFF, I can have something to check my tool's outputs against, especially if they are structural annotations/gene models as opposed to some arbitrary feature/functional annotation. Hopefully some of these links help push the discussion forward!

Some collections of "real-world" cases, useful for building a testing suite

https://github.com/BioJulia/BioFmtSpecimens - Collection of real-world bioinformatics file format specimens to test against

https://github.com/cmdcolin/oddgenes - Collection of "odd genes" and edge cases

Some additional GFF parsers / validators that haven't been mentioned yet

https://gfacs.readthedocs.io/en/latest/index.html - gFACs: Gene Filtering, Analysis, and Conversion

https://agat.readthedocs.io/en/latest/agat_how_does_it_work.html - AGAT: Another GTF/GFF Analysis Toolkit

https://easy-import.readme.io/docs/repairing-gff - easy-import

https://github.com/TAMU-CPT/CPT_GffParser - a BioPython-compatible library for parsing and fixing GFF data

Some format conversion-specific tools

https://bioconvert.readthedocs.io/en/main/# - 'BioConvert': a collaborative project to facilitate the interconversion of life science data formats

https://github.com/jorvis/biocode - 'biocode': a collection of bioinformatics code libraries and scripts (see gff subdirectory)

cmungall commented 6 months ago

Great to see there is still a lot of interest in this

I am planning to create a LinkML schema for GFF3. This would have a lot of advantages:

This could serve as a reference against which different validators could indicate conformance, and also directly as a validator

dtdoering commented 4 months ago

That sounds nice! Glad to see that it is YAML-based and not something complex or particularly domain-specific.

However, can you clarify the concept of "profiles"? To my ear, it sounds like it would involve different rule-sets for GFFs from different organisms. If that's the case, (IMO) that seems like something that would be a minor step in the wrong direction (though that discussion may be for another thread).