canon-gff3 problem with IDs

vpbrendel commented 5 years ago

Input:

##gff-version 3
##sequence-region   Chr5 113921 116411
Chr5    Araport11       gene    113921  116411  .       -       .       ID=AT5G01280;Name=AT5G01280;Note=GPI-anchored protein;curator_summary=Encodes a microtubule-associated protein.;Alias=BPP3,BASIC PROLINE-RICH PROTEIN3;computational_description=BEST Arabidopsis thaliana protein match is: proline-rich family protein (TAIR:AT3G09000.1)%3B Has 1807 Blast hits to 1807 proteins in 277 species: Archae - 0%3B Bacteria - 0%3B Metazoa - 736%3B Fungi - 347%3B Plants - 385%3B Viruses - 0%3B Other Eukaryotes - 339 (source: NCBI BLink).;Dbxref=PMID:24134884,locus:2150144;locus_type=protein_coding
Chr5    Araport11       mRNA    113921  116306  .       -       .       ID=AT5G01280.3;Parent=AT5G01280;Name=AT5G01280.3;Note=GPI-anchored protein;Alias=BPP3,BASIC PROLINE-RICH PROTEIN3;Dbxref=UniProt:Q9LFA8
Chr5    Araport11       three_prime_UTR 113921  114184  .       -       .       ID=AT5G01280:three_prime_UTR:3;Parent=AT5G01280.3;Name=AT5G01280:three_prime_UTR:3
Chr5    Araport11       exon    113921  114373  .       -       .       ID=AT5G01280:exon:8;Parent=AT5G01280.3;Name=AT5G01280:exon:8
Chr5    Araport11       CDS     114185  114373  .       -       0       ID=AT5G01280:CDS:7;Parent=AT5G01280.3;Name=AT5G01280:CDS:7
Chr5    Araport11       CDS     114465  115487  .       -       0       ID=AT5G01280:CDS:5;Parent=AT5G01280.3;Name=AT5G01280:CDS:5
Chr5    Araport11       CDS     115604  115691  .       -       1       ID=AT5G01280:CDS:4;Parent=AT5G01280.3;Name=AT5G01280:CDS:4
Chr5    Araport11       CDS     116155  116237  .       -       0       ID=AT5G01280:CDS:3;Parent=AT5G01280.3;Name=AT5G01280:CDS:3
Chr5    Araport11       mRNA    113921  116306  .       -       .       ID=AT5G01280.2;Parent=AT5G01280;Name=AT5G01280.2;Note=GPI-anchored protein;Alias=BPP3,BASIC PROLINE-RICH PROTEIN3
Chr5    Araport11       three_prime_UTR 113921  114455  .       -       .       ID=AT5G01280:three_prime_UTR:2;Parent=AT5G01280.2;Name=AT5G01280:three_prime_UTR:2
Chr5    Araport11       exon    113921  115487  .       -       .       ID=AT5G01280:exon:7;Parent=AT5G01280.2;Name=AT5G01280:exon:7
Chr5    Araport11       CDS     114456  115487  .       -       0       ID=AT5G01280:CDS:6;Parent=AT5G01280.2;Name=AT5G01280:CDS:6
Chr5    Araport11       CDS     115604  115691  .       -       1       ID=AT5G01280:CDS:4.1;Parent=AT5G01280.2;Name=AT5G01280:CDS:4.1
Chr5    Araport11       CDS     116155  116237  .       -       0       ID=AT5G01280:CDS:3.1;Parent=AT5G01280.2;Name=AT5G01280:CDS:3.1
Chr5    Araport11       exon    116155  116306  .       -       .       ID=AT5G01280:exon:2;Parent=AT5G01280.3,AT5G01280.2;Name=AT5G01280:exon:2
Chr5    Araport11       five_prime_UTR  116238  116306  .       -       .       ID=AT5G01280:five_prime_UTR:2;Parent=AT5G01280.3,AT5G01280.2;Name=AT5G01280:five_prime_UTR:2
Chr5    Araport11       mRNA    113999  116411  .       -       .       ID=AT5G01280.1;Parent=AT5G01280;Name=AT5G01280.1;Note=GPI-anchored protein;curator_summary=Encodes a microtubule-associated protein.;conf_class=4;Alias=BPP3,BASIC PROLINE-RICH PROTEIN3;computational_description=BEST Arabidopsis thaliana protein match is: proline-rich family protein (TAIR:AT3G09000.1)%3B Has 1807 Blast hits to 1807 proteins in 277 species: Archae - 0%3B Bacteria - 0%3B Metazoa - 736%3B Fungi - 347%3B Plants - 385%3B Viruses - 0%3B Other Eukaryotes - 339 (source: NCBI BLink).;conf_rating=***;Dbxref=gene:3442612
Chr5    Araport11       three_prime_UTR 113999  114184  .       -       .       ID=AT5G01280:three_prime_UTR:1;Parent=AT5G01280.1;Name=AT5G01280:three_prime_UTR:1
Chr5    Araport11       exon    113999  114373  .       -       .       ID=AT5G01280:exon:6;Parent=AT5G01280.1;Name=AT5G01280:exon:6
Chr5    Araport11       CDS     114185  114373  .       -       0       ID=AT5G01280:CDS:7.1;Parent=AT5G01280.1;Name=AT5G01280:CDS:7.1
Chr5    Araport11       CDS     114465  115487  .       -       0       ID=AT5G01280:CDS:5.1;Parent=AT5G01280.1;Name=AT5G01280:CDS:5.1
Chr5    Araport11       exon    114465  115487  .       -       .       ID=AT5G01280:exon:5;Parent=AT5G01280.3,AT5G01280.1;Name=AT5G01280:exon:5
Chr5    Araport11       CDS     115604  115691  .       -       1       ID=AT5G01280:CDS:4.2;Parent=AT5G01280.1;Name=AT5G01280:CDS:4.2
Chr5    Araport11       exon    115604  115691  .       -       .       ID=AT5G01280:exon:4;Parent=AT5G01280.3,AT5G01280.2,AT5G01280.1;Name=AT5G01280:exon:4
Chr5    Araport11       exon    116155  116266  .       -       .       ID=AT5G01280:exon:3;Parent=AT5G01280.1;Name=AT5G01280:exon:3
Chr5    Araport11       CDS     116155  116266  .       -       2       ID=AT5G01280:CDS:2;Parent=AT5G01280.1;Name=AT5G01280:CDS:2
Chr5    Araport11       CDS     116344  116386  .       -       0       ID=AT5G01280:CDS:1;Parent=AT5G01280.1;Name=AT5G01280:CDS:1
Chr5    Araport11       exon    116344  116411  .       -       .       ID=AT5G01280:exon:1;Parent=AT5G01280.1;Name=AT5G01280:exon:1
Chr5    Araport11       five_prime_UTR  116387  116411  .       -       .       ID=AT5G01280:five_prime_UTR:1;Parent=AT5G01280.1;Name=AT5G01280:five_prime_UTR:1
###
Chr5    Araport11       protein 114185  116237  .       -       .       ID=AT5G01280.3-Protein;Name=AT5G01280.3;Derives_from=AT5G01280.3
###
Chr5    Araport11       protein 114185  116386  .       -       .       ID=AT5G01280.1-Protein;Name=AT5G01280.1;Derives_from=AT5G01280.1
###
Chr5    Araport11       protein 114456  116237  .       -       .       ID=AT5G01280.2-Protein;Name=AT5G01280.2;Derives_from=AT5G01280.2
###

Run:

cat test.gff3 | canon-gff3 > test.canon.gff3

echo "in test.gff3:"
echo ""
egrep "114185|114465" test.gff3 | egrep CDS
echo ""
echo ""
echo "in test.canon.gff3:"
echo ""
egrep "114185|114465" test.canon.gff3 | egrep CDS

This shows that canon.gff3 creates a mismatch between Name and ID (see CDS:5).

standage commented 5 years ago

I think this represents a discrepancy between how the CDS is encoded in the Araport file and how CDSs are typically encoded in GFF3. In this test file, there are 4-5 CDS records associated with each mRNA. These of course don't represent distinct coding sequences, but a single discontinuous CDS. In GFF3, this is typically represented as a multifeature, where multiple records/lines are required to fully represent a single feature. The key to defining multifeatures is that each record associated with the multifeature must have the same ID (see the canonical gene example from the GFF3 spec).

The Araport GFF3 doesn't follow this convention. Frankly, this isn't all that uncommon. The canon-gff3 program uses the GenomeTools GFF3 parser, and it looks like the GenomeTools core devs designed the parser to detect cases like this and interpret them correctly as multifeatures. The result is a multifeature in the output, where the ID now equals the ID of the first record encountered. Since GenomeTools doesn't automatically update the Name attribute, this is the source of the discrepancy.

vpbrendel commented 5 years ago

Hmm. This is still a problem. See [1]gt gff3validator test.gff3 input is valid GFF3 []gt gff3validator test.canon.gff3 gt gff3validator: error: the multi-feature with ID "AT5G01280:CDS:7" on line 9 in file "test.canon.gff3" has a different attribute 'Name' than its counterpart on line 8 ('AT5G01280:CDS:5' vs. 'AT5G01280:CDS:7')

That is, canon-gff3 turns valid GFF3 into something that is not.

standage commented 5 years ago

In a syntactic sense, yes the input is a valid GFF3 file. But in a semantic sense it's not—4+ CDSs per mRNA. This doesn't trigger a warning or error message with the GenomeTools validator, but it does trigger corrective measures, probably in the GenomeTools GFF3 writer if I had to guess.

There's an argument that this particular input should cause a warning message in the GFF3 validator. But the link to the GFF3 spec in my last comment shows a valid, though less commonly used, encoding of multiple CDSs for a single mRNA. Distinguishing between these two scenarios is a bit involved, and presumably determined to be out-of-scope for the validator by the GenomeTools folks.

Updating the GenomeTools validator, GFF3 parser, or GFF3 writer are all possibilities, though they would be pretty labor intensive. Perhaps the solution is to clarify that AEGeAn Toolkit programs, and the GenomeTools library on which they rely, will work correctly when CDS features are encoded as per the spec, but may have unexpected behavior when input data deviates from the spec.

Note: the documentation already discusses GFF3 and some common pitfalls: https://aegean.readthedocs.io/en/stable/gff3.html. Perhaps I can add another point for this common multifeature ID issue here.

vpbrendel commented 5 years ago

Ok. I think one solution in our case would be to remove the Name attribute from the CDS lines in the canon-gff3 produced file. Thanks for the clarification.

BrendelGroup / AEGeAn

canon-gff3 problem with IDs #206