The-Sequence-Ontology / SO-Ontologies

Collect of SO Ontologies
Creative Commons Attribution 4.0 International
96 stars 37 forks source link

adding synonyms for INSDC terms #378

Closed murphyte closed 6 years ago

murphyte commented 8 years ago

Hi SO -- we're continuing to revise our mappings of INSDC to SO terms used in some code at NCBI, and I think it would make sense to add many of the INSDC terms as synonyms in SO. I've reviewed the list, and please consider adding the terms in the table below. For this list, I've included underscores because those are part of the INSDC term, but I did not list anything where the INSDC term is already present if you treat underscores and spaces as equivalent. That is, for the SO term "minus_10_signal", the INSDC term is "-10_signal" (with an underscore), which I didn't include here because "-10 signal" is already present. Let me know if you think there would be some value in including the INSDC synonym with the underscore, and I'd be happy to add those.

There might be some value in taking this one step further and adding a separate "INSDC term" field in SO rather than just having these mixed in as synonyms. If that's of interest, I can send along our full list. It's a little tricky because in some cases the mapping is directly to an INSDC feature type, whereas others are mapped to a "class" qualifier on a more general INSDC feature. For example, the SO term "minus_35_signal" corresponds to an INSDC "regulatory" feature with the qualifier /regulatory_class="minus_35_signal". This is true for the INSDC feature types "regulatory", "ncRNA", and "misc_recomb", which each have a "class" qualifier providing more specificity (often synonymous with SO). We can discuss it more if you think going this more formal route could be worthwhile.

For review, the INSDC feature table specs are at: http://www.insdc.org/files/feature_table.html

But for starters, here is my list of proposed synonyms to add:

SO term proposed synonym(s) notes
three_prime_UTR 3'UTR, 3' UTR -
five_prime_UTR 5'UTR -
C_gene_segment C_region -
D_gene_segment D_segment -
J_gene_segment J_segment -
GC_rich_promoter_region GC_signal -
ribosome_entry_site RBS -
TATA_box TATA_signal now deprecated as a separate INSDC feature type and replaced with "regulatory", and regulatory_class="TATA_box", but for historical purposes this may be a useful synonym
V_gene_segment V _segment -
gap assembly gap -
remark comment can exist as a feature type in NCBI ASN.1, but not a formal INSDC term so NCBI flatfile displays as misc_feature
mature_peptide mat_peptide -
binding_site misc_binding -
sequence_difference misc_difference -
sequence_feature misc_feature -
recombination_feature misc_recomb -
regulatory_region misc_signal INSDC term now deprecated, so could skip this one
sequence_secondary_structure misc_structure -
mobile_genetic_element mobile_element -
modified_DNA_base modified_base -
transcript misc_RNA -
primary_transcript pre_RNA used in NCBI ASN.1. INSDC term is precursor_RNA and "precursor RNA" is already a synonym, so feel free to skip this one
propeptide proprotein -
primary_transcript prim_transcript -
primer_binding_site primer_bind -
protein_binding_site protein_bind -
sequence_secondary_structure SecStr not INSDC per se because it's a protein feature, but it's used in GenBank GenPept format, e.g. https://www.ncbi.nlm.nih.gov/protein/5T6O_A
origin_of_replication rep_origin -
restriction_enzyme_cut_site Rsite not an INSDC term, but used in NCBI's ASN.1 format so it could be a useful synonym. Or feel free to skip this one
satellite_DNA satellite this isn't an INSDC feature but rather a qualifier used on a repeat_region feature, but it would be helpful to have the synonym in SO
signal_peptide sig_peptide -
sequence_alteration variation -
boundary_element insulator regulatory_class qualifier
ribosome_entry_site ribosome_binding_site regulatory_class qualifier

Thanks!

-Terence Murphy

nicoleruiz commented 8 years ago

You can send us all the INSDC terms that you would like to map even if you see your term is already added as a synonym. We can specify that the synonym is from INSDC (ex. INSDC:-10_signal). We have done this for terms we have mapped that come from variant annotation tools. Your synonym would appear like this in the browser:

screen shot 2016-11-03 at 11 07 55 am

I couldn't find the term proprotein in the INSDC feature table spec. Is this term from a different controlled vocabulary?

@keilbeck Will it be a problem to map terms that are qualifiers and not feature types?

murphyte commented 8 years ago

I couldn't find the term proprotein in the INSDC feature table spec. Is this term from a different controlled vocabulary?

The INSDC spec is specific for nucleotides. It includes some protein-related features, like signal peptides and mature peptides, but in the context of annotating a range on a nucleotide sequence overlapping a CDS feature. Beyond the INSDC spec and "GenBank" format, NCBI has the GenPept flatfile format for protein records, which includes some additional features that can be annotated on proteins. If they don't have a nucleotide equivalent, then they're converted to another feature type like "misc_feature" if projected from the protein onto a nucleotide record in GenBank flatfile format. Does that make sense?

Here's a protein record in GenPept format with a 'proprotein' example: https://www.ncbi.nlm.nih.gov/protein/NP_001165705.1

On top of that, there are a few additions to the INSDC specs that are coming soon (approved but not yet added to documentation). propeptide is one of those, meaning a 'proprotein' feature annotated on a protein sequence will be displayed as a 'propeptide' feature when projected onto a corresponding nucleotide sequence.

I'm not aware of public documentation with the full list of extra feature types supported in GenPept on top of what's in the INSDC feature spec. I could identify them in the full conversion table, and they could be added as "NCBI:" or "GenPept:" synonyms, instead of "INSDC:"

I should also say that this isn't a formally-endorsed INSDC:SO mapping table, and I'm thinking I should check into that before formally labeling these as "INSDC:" synonyms in the SO specs. I'll do that before providing the full table.

WRT the issue of mapping to feature type vs. qualifier, I'm looking at this as having two use cases:

  1. making it SO-legal to use an INSDC *class qualifier term verbatim, even if it's not an exact match to the equivalent SO term
  2. providing information to help others map INSDC feature or feature+qualifier values to SO equivalents.

For the first use case, I'd want to have just the regulatory_class qualifier value "INSDC:ribosome_binding_site" map to the SO term ribosome_entry_site (SO:0000139). For the 2nd use case, it's helpful to know both the feature and qualifier values, like "INSDC:regulatory-ribosome_binding_site". I'm not sure what the best way to express that in the SO records would be.

Best regards,

-Terence

murphyte commented 7 years ago

Noting the existence of this ancient INSDC:SO correspondence table: http://sequenceontology.org/resources/mapping/FT_SO.html

There's a broken link to that page at: http://www.sequenceontology.org/resources/faq.html#map

I'm working on formalizing my mapping table with the INSD collaborators, but I think it's about done. Hopefully I'll be able to get the file to you next week, and we can discuss further how to fit it into your official synonym data model.

-Terence

murphyte commented 7 years ago

I won't call this an official INSDC:SO mapping table, but I'm attaching the mapping that we're currently using for conversion between INSDC (or more specifically NCBI ASN.1) and SO terms. A few comments about the mapping:

  1. In many cases it's a combination of INSDC feature + qualifier + qualifier value that is needed to map to a specific SO term, which I've included as the first three columns of the table. This will require establishing some formatting rules to fit into the existing SO synonym mechanism. Perhaps "INSDC:ncRNA:ncRNA_class:miRNA" to indicate the INSDC feature/qualifier/value triplet to use for SO:0000276 (miRNA)?
  2. We're internally using a few non-INSDC qualifiers like "/feat_class" on misc_feature for RefSeq to aid in conversion from ASN.1 to SO. For converting SO terms to ASN.1, like when submitting to GenBank, /feat_class will be ignored, so we also provide the value in the INSDC /note qualifier. For the purpose of SO, it may make the most sense to add these as always using the note qualifier, like "INSDC:misc_feature:note:conserved_region" mapping to SO:0000330 (conserved_region)
  3. There are a few places where there are multiple INSDC terms that I've mapped to a single SO term, like INSDC assembly_gap and gap. For SO, that will necessitate having two INSDC synonyms, and indicating which one is preferred.
  4. There are some gaps in the table, most of which have open SO tickets to add terms (#379, #386).

Please take a look and see what you think of my proposed solution for adding INSDC triplets to synonyms, or if there's a better way. But it seems like getting the mapping into SO would save others the effort of repeating the exercise.

-Terence

INSDC_SO_mapping.xlsx

nicoleruiz commented 6 years ago

I have added all of the INSDC mappings. Please let me know if there are any additional changes or mappings that need to be added.