NAL-i5K / AgBioData_GFF3_recommendation

The AgBioData GFF3 working group has developed recommendations to solve common problems in the GFF3 format. We suggest improvements for each of the GFF3 fields, as well as the special cases of modeling functional annotations, and standard protein-coding genes. We welcome discussion of these recommendations from the larger community.
Creative Commons Zero v1.0 Universal
5 stars 4 forks source link

Modeling non-gene features as top-level `type` fields #8

Open ifiddes opened 2 years ago

ifiddes commented 2 years ago

Under the recommendations for the type field, you say:

Best practice: Top-level feature types can include gene and pseudogene. Optionally, include a so_term_name attribute in column 9 to specify the child (type) of gene - e.g. protein_coding_gene, ncRNA_gene, miRNA_gene and snoRNA_gene (http://purl.obolibrary.org/obo/SO_0000704). Transcript features should include the appropriate SO term in column 3 (e.g. mRNA, snoRNA, etc).

I agree with all of this, but I think that the recommendation should be extended further to regularize non-transcribed features.

Right now non-transcribed features can be all over the map, and as a result become hard to parse. In the NCBI annotation of GRCh38, a wide array of top-level non-gene features are used. Additionally, I have not seen any spec define a collection of non-transcribed features (analogous to isoforms of a gene).

In the specification I built under the BioCantor repo, I attempted to regularize top-level features by calling any grouping of non-transcribed features a biological region (which I chose based on SO:0001411), and then deviated from SO by calling any interval in that grouping a feature_interval. I then also chose to define a "joined" interval of non-transcribed feature (analogous to an exon) a subregion.

vkkodali commented 2 years ago

Hi @ifiddes thank you for your comment. Currently, the focus of these recommendation is on protein-coding genes. The point here is a general recommendation to just use “gene” and “pseudogene” in column 3 for genes, and provide additional granularity of gene types in column 9, as opposed to saying protein_coding_gene in column 3. Properly parsing the broader scope of SO types that can be represented in GFF3 requires using the SO hierarchy. While I understand the challenges posed by using a wide range of terms in column 3, I believe calling everything “biological_region” would be a huge generalization, and force ad hoc processing of non-standard attributes in column 9 to make use of the rich annotation.