Open vpbrendel opened 5 years ago
I think this represents a discrepancy between how the CDS is encoded in the Araport file and how CDSs are typically encoded in GFF3. In this test file, there are 4-5 CDS records associated with each mRNA. These of course don't represent distinct coding sequences, but a single discontinuous CDS. In GFF3, this is typically represented as a multifeature, where multiple records/lines are required to fully represent a single feature. The key to defining multifeatures is that each record associated with the multifeature must have the same ID (see the canonical gene example from the GFF3 spec).
The Araport GFF3 doesn't follow this convention. Frankly, this isn't all that uncommon. The canon-gff3
program uses the GenomeTools GFF3 parser, and it looks like the GenomeTools core devs designed the parser to detect cases like this and interpret them correctly as multifeatures. The result is a multifeature in the output, where the ID now equals the ID of the first record encountered. Since GenomeTools doesn't automatically update the Name
attribute, this is the source of the discrepancy.
Hmm. This is still a problem. See [1]gt gff3validator test.gff3 input is valid GFF3 []gt gff3validator test.canon.gff3 gt gff3validator: error: the multi-feature with ID "AT5G01280:CDS:7" on line 9 in file "test.canon.gff3" has a different attribute 'Name' than its counterpart on line 8 ('AT5G01280:CDS:5' vs. 'AT5G01280:CDS:7')
That is, canon-gff3 turns valid GFF3 into something that is not.
In a syntactic sense, yes the input is a valid GFF3 file. But in a semantic sense it's not—4+ CDSs per mRNA. This doesn't trigger a warning or error message with the GenomeTools validator, but it does trigger corrective measures, probably in the GenomeTools GFF3 writer if I had to guess.
There's an argument that this particular input should cause a warning message in the GFF3 validator. But the link to the GFF3 spec in my last comment shows a valid, though less commonly used, encoding of multiple CDSs for a single mRNA. Distinguishing between these two scenarios is a bit involved, and presumably determined to be out-of-scope for the validator by the GenomeTools folks.
Updating the GenomeTools validator, GFF3 parser, or GFF3 writer are all possibilities, though they would be pretty labor intensive. Perhaps the solution is to clarify that AEGeAn Toolkit programs, and the GenomeTools library on which they rely, will work correctly when CDS features are encoded as per the spec, but may have unexpected behavior when input data deviates from the spec.
Note: the documentation already discusses GFF3 and some common pitfalls: https://aegean.readthedocs.io/en/stable/gff3.html. Perhaps I can add another point for this common multifeature ID issue here.
Ok. I think one solution in our case would be to remove the Name attribute from the CDS lines in the canon-gff3 produced file. Thanks for the clarification.
Input:
Run:
This shows that canon.gff3 creates a mismatch between Name and ID (see CDS:5).