geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
43 stars 89 forks source link

MEROPS entries are classified as proteins, but this isn't true for all MEROPS entries #1875

Open cmungall opened 2 years ago

cmungall commented 2 years ago

Current entry:

- database: MEROPS
  name: MEROPS peptidase database
  generic_urls:
    - https://www.ebi.ac.uk/merops/
  entity_types:
    - type_name: protein
      type_id: PR:000000001
      url_syntax: https://www.ebi.ac.uk/merops/cgi-bin/pepsum?id=[example_id]
      example_id: MEROPS:A08.001
      example_url: https://www.ebi.ac.uk/merops/cgi-bin/pepsum?id=A08.001

This is currently causing problems for groups that want to use MEROPS in conjunction with IKR, as IKR is restricted to families.

There is an issue for better documenting this:

I also made a separate issue in bioregistry for MEROPs, it seems we are all treating MEROPS different and incorrectly in different ways:

sjm41 commented 2 years ago

MEROPS entries of the given form (e.g. MEROPS:A08.001) are called "MEROPS IDs" at their database. This “MEROPS ID” groups equivalent peptidases (though can sometimes represent a single peptidase?) and is defined here as: Each peptidase is given a unique identifier known as a MEROPS ID. The identifier consists of the family identifier (padded to three characters), a dot, and a three-digit number, e.g. S01.001. Peptidases from different organisms are assigned to a single ID when the available data indicate that they are equivalent. Special forms of MEROPS ID are used for uncharacterized peptidases from model organisms, unassigned peptidases, non-peptidase homologues, pseudogenes and unsequenced peptidases. An index of MEROPS IDs is here: https://www.ebi.ac.uk/merops/cgi-bin/id_index?type=peptidase;action=A

AFAIK, these entries are always correspond to 'families'. E.g., the example given in the GO entry above (A08.001) is for "signal peptidase II" and has the "holotype" of "signal peptidase II (Escherichia coli), Uniprot accession P00804 (peptidase unit: 1-164), MERNUM MER0001313". But this ID encompasses hundreds of individual sequences from different species: https://www.ebi.ac.uk/merops/cgi-bin/sequence_data?mid=A08.001

So I think the GO xrefs metadata file should have "type_name: gene/protein family" for MEROPS.

cmungall commented 2 years ago

@pgaudet made an alternate fix for #1882.

I think this is probably good enough for now?