bridgedb / BridgeDb

The BridgeDb Library source code
https://bridgedb.org/
Apache License 2.0
28 stars 21 forks source link

hgnc identifiers - support with and without HGNC: #15

Open stain opened 9 years ago

stain commented 9 years ago

In http://identifiers.org/hgnc/ we see Identifier pattern ^((HGNC|hgnc):)?\d{1,5}$ which is (somewhat) reflected in the HGNC Accession number entry https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.ttl#L2762

bridgeDB:hasRegexPattern "^(HGNC:)?\\d{1,5}$" ;

This means that identifiers http://identifiers.org/hgnc/47710 and http://identifiers.org/hgnc/HGNC:47710 and http://identifiers.org/hgnc/hgnc:47710 are all valid - and indeed all resolve to RNU6-747P.

The IdentityMappingService is however unable to know these are the same thing, unless we move HGNC: out of the regular expression and add alternative URI prefixes. Currently this will be tracked as three identifier 47710, HGNC:47710 and hgnc:47710 in the same dataset.

HGNC itself consistently identifies a "HGNC ID" with the prefix, e.g. HGNC:47710 - which is in accordance with the 10 Simple rules for design, provision, and reuse of persistent identifiers for life science data rule 2 to use CURIEs.

In Open PHACTS, earlier linksets used the style http://identifiers.org/hgnc/47710 - however @JonathanMELIUS's latest [Ensembl-to-HGNC linkset])(http://bridgedb.org/data/linksets/HomoSapiens/Ensembl_Hs_hgnc.direct.LS.ttl) uses the style http://identifiers.org/hgnc/HGNC:47710 which adds the CURIE to the alternative base - perhaps this is not ideal (and can probably by changed upstream) - anyway as both patterns are accepted the org.bridgedb.rdf entry should be updated to support both.

Christian-B commented 9 years ago

This is not the first time the ID part of BridgeDB Xref includes text which should not be in the ID.

This also happen in for example: CHEMBI bridgeDB:systemCode "Ce"

The OPS solution was to store it without the standard part in the ID For ChEBI without the "CHEBI: Here that would be without the "HGNC:"

OPS saved these as Datasource ID pairs rather than Xrefs

There is then a tool to convert pairs to xref. https://github.com/bridgedb/BridgeDb/blob/OpenPHACTS/master/org.bridgedb.utils/src/org/bridgedb/pairs/CodeMapper.java

Special cases are then declared in the Datasource rdf file https://github.com/bridgedb/BridgeDb/blob/OpenPHACTS/master/org.bridgedb.rdf/resources/DataSource.ttl See ChEBI

This way OPS could use all the URL as discussed in this case Yet still return the same Xref as in the past.

Christian


From: Stian Soiland-Reyes [notifications@github.com] Sent: Wednesday, September 09, 2015 12:47 PM To: bridgedb/BridgeDb Subject: [BridgeDb] hgnc identifiers - support with and without HGNC: (#15)

In http://identifiers.org/hgnc/ we see Identifier pattern ^((HGNC|hgnc):)?\d{1,5}$ which is (somewhat) reflected in the HGNC Accession number entry https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.ttl#L2762

bridgeDB:hasRegexPattern "^(HGNC:)?\d{1,5}$" ;

This means that identifiers http://identifiers.org/hgnc/47710 and http://identifiers.org/hgnc/HGNC:47710 and http://identifiers.org/hgnc/hgnc:47710 are all valid - and indeed all resolve to RNU6-747P.

The IdentityMappingService is however unable to know these are the same thing, unless we move HGNC: out of the regular expression and add alternative URI prefixes. Currently this will be tracked as three identifier 47710, HGNC:47710 and hgnc:47710 in the same dataset.

HGNC itself consistently identifies a "HGNC ID" with the prefix, e.g. HGNC:47710 - which is in accordance with the 10 Simple rules for design, provision, and reuse of persistent identifiers for life science datahttp://dx.doi.org/10.5281/zenodo.18003 rule 2 to use CURIEs.

In Open PHACTS, earlier linksets used the style http://identifiers.org/hgnc/47710 - however @JonathanMELIUShttps://github.com/JonathanMELIUS's latest [Ensembl-to-HGNC linkset])(http://bridgedb.org/data/linksets/HomoSapiens/Ensembl_Hs_hgnc.direct.LS.ttl) uses the style http://identifiers.org/hgnc/HGNC:47710 which adds the CURIE to the alternative base - perhaps this is not ideal (and can probably by changed upstream) - anyway as both patterns are accepted the org.bridgedb.rdf entry should be updated to support both.

— Reply to this email directly or view it on GitHubhttps://github.com/bridgedb/BridgeDb/issues/15.

Christian-B commented 9 years ago

I think the URL http://identifiers.org/hgnc/HGNC:47710 Is a mistake on the part of identifers.org especially as they also have http://identifiers.org/hgnc/47710

Which is another way they break their own rule that there should be a single URL for each item.

Christian


From: Stian Soiland-Reyes [notifications@github.com] Sent: Wednesday, September 09, 2015 12:47 PM To: bridgedb/BridgeDb Subject: [BridgeDb] hgnc identifiers - support with and without HGNC: (#15)

In http://identifiers.org/hgnc/ we see Identifier pattern ^((HGNC|hgnc):)?\d{1,5}$ which is (somewhat) reflected in the HGNC Accession number entry https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.ttl#L2762

bridgeDB:hasRegexPattern "^(HGNC:)?\d{1,5}$" ;

This means that identifiers http://identifiers.org/hgnc/47710 and http://identifiers.org/hgnc/HGNC:47710 and http://identifiers.org/hgnc/hgnc:47710 are all valid - and indeed all resolve to RNU6-747P.

The IdentityMappingService is however unable to know these are the same thing, unless we move HGNC: out of the regular expression and add alternative URI prefixes. Currently this will be tracked as three identifier 47710, HGNC:47710 and hgnc:47710 in the same dataset.

HGNC itself consistently identifies a "HGNC ID" with the prefix, e.g. HGNC:47710 - which is in accordance with the 10 Simple rules for design, provision, and reuse of persistent identifiers for life science datahttp://dx.doi.org/10.5281/zenodo.18003 rule 2 to use CURIEs.

In Open PHACTS, earlier linksets used the style http://identifiers.org/hgnc/47710 - however @JonathanMELIUShttps://github.com/JonathanMELIUS's latest [Ensembl-to-HGNC linkset])(http://bridgedb.org/data/linksets/HomoSapiens/Ensembl_Hs_hgnc.direct.LS.ttl) uses the style http://identifiers.org/hgnc/HGNC:47710 which adds the CURIE to the alternative base - perhaps this is not ideal (and can probably by changed upstream) - anyway as both patterns are accepted the org.bridgedb.rdf entry should be updated to support both.

— Reply to this email directly or view it on GitHubhttps://github.com/bridgedb/BridgeDb/issues/15.

Christian-B commented 9 years ago

In the OPS BridgeBD branch we did consider what was a correct URI. Only what was a USED URI. We then did what was required to support these USED URIs. The only know URI pattern we did not support when I left the project was ones where the ID was split in two parts within that URI, As we had no use

If the IMS only wanted to support standard URIs it would have been a lot easier to write but missed many URIs that users where using.

Christian


From: Stian Soiland-Reyes [notifications@github.com] Sent: Wednesday, September 09, 2015 12:47 PM To: bridgedb/BridgeDb Subject: [BridgeDb] hgnc identifiers - support with and without HGNC: (#15)

In http://identifiers.org/hgnc/ we see Identifier pattern ^((HGNC|hgnc):)?\d{1,5}$ which is (somewhat) reflected in the HGNC Accession number entry https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.ttl#L2762

bridgeDB:hasRegexPattern "^(HGNC:)?\d{1,5}$" ;

This means that identifiers http://identifiers.org/hgnc/47710 and http://identifiers.org/hgnc/HGNC:47710 and http://identifiers.org/hgnc/hgnc:47710 are all valid - and indeed all resolve to RNU6-747P.

The IdentityMappingService is however unable to know these are the same thing, unless we move HGNC: out of the regular expression and add alternative URI prefixes. Currently this will be tracked as three identifier 47710, HGNC:47710 and hgnc:47710 in the same dataset.

HGNC itself consistently identifies a "HGNC ID" with the prefix, e.g. HGNC:47710 - which is in accordance with the 10 Simple rules for design, provision, and reuse of persistent identifiers for life science datahttp://dx.doi.org/10.5281/zenodo.18003 rule 2 to use CURIEs.

In Open PHACTS, earlier linksets used the style http://identifiers.org/hgnc/47710 - however @JonathanMELIUShttps://github.com/JonathanMELIUS's latest [Ensembl-to-HGNC linkset])(http://bridgedb.org/data/linksets/HomoSapiens/Ensembl_Hs_hgnc.direct.LS.ttl) uses the style http://identifiers.org/hgnc/HGNC:47710 which adds the CURIE to the alternative base - perhaps this is not ideal (and can probably by changed upstream) - anyway as both patterns are accepted the org.bridgedb.rdf entry should be updated to support both.

— Reply to this email directly or view it on GitHubhttps://github.com/bridgedb/BridgeDb/issues/15.

stain commented 9 years ago

@JonathanMELIUS, while I think @Christian-B is right that we need to support what has been used, I also agree that http://identifiers.org/hgnc/47710 without HGNC: is "more correct" - would you be OK to change your ensembl linkset for this, or is it in the style of http://identifiers.org/hgnc/HGNC:47710 also used upstream in Ensembl-RDF?

JonathanMELIUS commented 9 years ago

Yes sure.

egonw commented 7 years ago

@JonathanMELIUS, so, do I understand correctly that you have it without the HGNC: in the current Ensembl Derby files and link sets?