biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
115 stars 50 forks source link

WormBase regex is too restrictive #1186

Closed cmungall closed 2 days ago

cmungall commented 4 days ago

bioregistry complains about this https://bioregistry.io/reference/wormbase:/CE00274

but this is a valid wormbase protein ID

The GO registry has info that could be used here but IMO bioregistry could just not bother having a regex for many local IDs, it's hard to keep them up to date

- database: WB
  name: WormBase database of nematode biology
  synonyms:
    - WormBase
  rdf_uri_prefix: https://identifiers.org/wb
  generic_urls:
    - http://www.wormbase.org/
  entity_types:
    - type_name: gene
      type_id: SO:0000704
      id_syntax: (CE[0-9]{5})|(WB(Gene|Var|RNAi|Transgene)[0-9]{8})
      url_syntax: http://www.wormbase.org/get?name=[example_id]
      example_id: WB:WBGene00003001
      example_url: http://www.wormbase.org/get?name=WBGene00003001
    - type_name: variation
      type_id: VariO:0001
      id_syntax: (CE[0-9]{5})|(WB(Gene|Var|RNAi|Transgene)[0-9]{8})
      url_syntax: http://www.wormbase.org/get?name=[example_id]
      example_id: WB:WBVar00089843
      example_url: http://www.wormbase.org/get?name=WBVar00089843
    - type_name: RNAi
      type_id: topic:3523
      id_syntax: (CE[0-9]{5})|(WB(Gene|Var|RNAi|Transgene)[0-9]{8})
      url_syntax: http://www.wormbase.org/get?name=[example_id]
      example_id: WB:WBRNAi00110325
      example_url: http://www.wormbase.org/get?name=WBRNAi00110325
    - type_name: Transgene
      type_id: SO:0000902
      id_syntax: (CE[0-9]{5})|(WB(Gene|Var|RNAi|Transgene)[0-9]{8})
      url_syntax: http://www.wormbase.org/get?name=[example_id]
      example_id: WB:WBTransgene00000059
      example_url: http://www.wormbase.org/get?name=WBTransgene00000059
    - type_name: protein
      type_id: PR:000000001
      id_syntax: (CE[0-9]{5})|(WB(Gene|Var|RNAi|Transgene)[0-9]{8})
      url_syntax: http://www.wormbase.org/get?name=[example_id]
      example_id: WB:CE00274
      example_url: http://www.wormbase.org/get?name=CE00274
bgyori commented 4 days ago

Thanks @cmungall! The current pattern in the Bioregistry

^WB[A-Z][a-z]+\d+$

implicitly covers this part

WB(Gene|Var|RNAi|Transgene)[0-9]{8}

of the GO registry's pattern, so it looks like adding CE[0-9]{5} is the only thing that would be needed here.

The patterns do have some added value for validation (is the ID valid?) / discovery (what kind of ID is this?) use cases which I think Bioregistry users are often faced with so I would rather extend the pattern than revert it to something completely generic.