Open stain opened 9 years ago
This is not the first time the ID part of BridgeDB Xref includes text which should not be in the ID.
This also happen in for example: CHEMBI bridgeDB:systemCode "Ce"
The OPS solution was to store it without the standard part in the ID For ChEBI without the "CHEBI: Here that would be without the "HGNC:"
OPS saved these as Datasource ID pairs rather than Xrefs
There is then a tool to convert pairs to xref. https://github.com/bridgedb/BridgeDb/blob/OpenPHACTS/master/org.bridgedb.utils/src/org/bridgedb/pairs/CodeMapper.java
Special cases are then declared in the Datasource rdf file https://github.com/bridgedb/BridgeDb/blob/OpenPHACTS/master/org.bridgedb.rdf/resources/DataSource.ttl See ChEBI
This way OPS could use all the URL as discussed in this case Yet still return the same Xref as in the past.
Christian
From: Stian Soiland-Reyes [notifications@github.com] Sent: Wednesday, September 09, 2015 12:47 PM To: bridgedb/BridgeDb Subject: [BridgeDb] hgnc identifiers - support with and without HGNC: (#15)
In http://identifiers.org/hgnc/ we see Identifier pattern ^((HGNC|hgnc):)?\d{1,5}$ which is (somewhat) reflected in the HGNC Accession number entry https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.ttl#L2762
bridgeDB:hasRegexPattern "^(HGNC:)?\d{1,5}$" ;
This means that identifiers http://identifiers.org/hgnc/47710 and http://identifiers.org/hgnc/HGNC:47710 and http://identifiers.org/hgnc/hgnc:47710 are all valid - and indeed all resolve to RNU6-747P.
The IdentityMappingService is however unable to know these are the same thing, unless we move HGNC: out of the regular expression and add alternative URI prefixes. Currently this will be tracked as three identifier 47710, HGNC:47710 and hgnc:47710 in the same dataset.
HGNC itself consistently identifies a "HGNC ID" with the prefix, e.g. HGNC:47710 - which is in accordance with the 10 Simple rules for design, provision, and reuse of persistent identifiers for life science datahttp://dx.doi.org/10.5281/zenodo.18003 rule 2 to use CURIEs.
In Open PHACTS, earlier linksets used the style http://identifiers.org/hgnc/47710 - however @JonathanMELIUShttps://github.com/JonathanMELIUS's latest [Ensembl-to-HGNC linkset])(http://bridgedb.org/data/linksets/HomoSapiens/Ensembl_Hs_hgnc.direct.LS.ttl) uses the style http://identifiers.org/hgnc/HGNC:47710 which adds the CURIE to the alternative base - perhaps this is not ideal (and can probably by changed upstream) - anyway as both patterns are accepted the org.bridgedb.rdf entry should be updated to support both.
— Reply to this email directly or view it on GitHubhttps://github.com/bridgedb/BridgeDb/issues/15.
I think the URL http://identifiers.org/hgnc/HGNC:47710 Is a mistake on the part of identifers.org especially as they also have http://identifiers.org/hgnc/47710
Which is another way they break their own rule that there should be a single URL for each item.
Christian
From: Stian Soiland-Reyes [notifications@github.com] Sent: Wednesday, September 09, 2015 12:47 PM To: bridgedb/BridgeDb Subject: [BridgeDb] hgnc identifiers - support with and without HGNC: (#15)
In http://identifiers.org/hgnc/ we see Identifier pattern ^((HGNC|hgnc):)?\d{1,5}$ which is (somewhat) reflected in the HGNC Accession number entry https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.ttl#L2762
bridgeDB:hasRegexPattern "^(HGNC:)?\d{1,5}$" ;
This means that identifiers http://identifiers.org/hgnc/47710 and http://identifiers.org/hgnc/HGNC:47710 and http://identifiers.org/hgnc/hgnc:47710 are all valid - and indeed all resolve to RNU6-747P.
The IdentityMappingService is however unable to know these are the same thing, unless we move HGNC: out of the regular expression and add alternative URI prefixes. Currently this will be tracked as three identifier 47710, HGNC:47710 and hgnc:47710 in the same dataset.
HGNC itself consistently identifies a "HGNC ID" with the prefix, e.g. HGNC:47710 - which is in accordance with the 10 Simple rules for design, provision, and reuse of persistent identifiers for life science datahttp://dx.doi.org/10.5281/zenodo.18003 rule 2 to use CURIEs.
In Open PHACTS, earlier linksets used the style http://identifiers.org/hgnc/47710 - however @JonathanMELIUShttps://github.com/JonathanMELIUS's latest [Ensembl-to-HGNC linkset])(http://bridgedb.org/data/linksets/HomoSapiens/Ensembl_Hs_hgnc.direct.LS.ttl) uses the style http://identifiers.org/hgnc/HGNC:47710 which adds the CURIE to the alternative base - perhaps this is not ideal (and can probably by changed upstream) - anyway as both patterns are accepted the org.bridgedb.rdf entry should be updated to support both.
— Reply to this email directly or view it on GitHubhttps://github.com/bridgedb/BridgeDb/issues/15.
In the OPS BridgeBD branch we did consider what was a correct URI. Only what was a USED URI. We then did what was required to support these USED URIs. The only know URI pattern we did not support when I left the project was ones where the ID was split in two parts within that URI, As we had no use
If the IMS only wanted to support standard URIs it would have been a lot easier to write but missed many URIs that users where using.
Christian
From: Stian Soiland-Reyes [notifications@github.com] Sent: Wednesday, September 09, 2015 12:47 PM To: bridgedb/BridgeDb Subject: [BridgeDb] hgnc identifiers - support with and without HGNC: (#15)
In http://identifiers.org/hgnc/ we see Identifier pattern ^((HGNC|hgnc):)?\d{1,5}$ which is (somewhat) reflected in the HGNC Accession number entry https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.ttl#L2762
bridgeDB:hasRegexPattern "^(HGNC:)?\d{1,5}$" ;
This means that identifiers http://identifiers.org/hgnc/47710 and http://identifiers.org/hgnc/HGNC:47710 and http://identifiers.org/hgnc/hgnc:47710 are all valid - and indeed all resolve to RNU6-747P.
The IdentityMappingService is however unable to know these are the same thing, unless we move HGNC: out of the regular expression and add alternative URI prefixes. Currently this will be tracked as three identifier 47710, HGNC:47710 and hgnc:47710 in the same dataset.
HGNC itself consistently identifies a "HGNC ID" with the prefix, e.g. HGNC:47710 - which is in accordance with the 10 Simple rules for design, provision, and reuse of persistent identifiers for life science datahttp://dx.doi.org/10.5281/zenodo.18003 rule 2 to use CURIEs.
In Open PHACTS, earlier linksets used the style http://identifiers.org/hgnc/47710 - however @JonathanMELIUShttps://github.com/JonathanMELIUS's latest [Ensembl-to-HGNC linkset])(http://bridgedb.org/data/linksets/HomoSapiens/Ensembl_Hs_hgnc.direct.LS.ttl) uses the style http://identifiers.org/hgnc/HGNC:47710 which adds the CURIE to the alternative base - perhaps this is not ideal (and can probably by changed upstream) - anyway as both patterns are accepted the org.bridgedb.rdf entry should be updated to support both.
— Reply to this email directly or view it on GitHubhttps://github.com/bridgedb/BridgeDb/issues/15.
@JonathanMELIUS, while I think @Christian-B is right that we need to support what has been used, I also agree that http://identifiers.org/hgnc/47710 without HGNC:
is "more correct" - would you be OK to change your ensembl linkset for this, or is it in the style of http://identifiers.org/hgnc/HGNC:47710 also used upstream in Ensembl-RDF?
Yes sure.
@JonathanMELIUS, so, do I understand correctly that you have it without the HGNC: in the current Ensembl Derby files and link sets?
In http://identifiers.org/hgnc/ we see Identifier pattern
^((HGNC|hgnc):)?\d{1,5}$
which is (somewhat) reflected in the HGNC Accession number entry https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.ttl#L2762This means that identifiers http://identifiers.org/hgnc/47710 and http://identifiers.org/hgnc/HGNC:47710 and http://identifiers.org/hgnc/hgnc:47710 are all valid - and indeed all resolve to RNU6-747P.
The IdentityMappingService is however unable to know these are the same thing, unless we move
HGNC:
out of the regular expression and add alternative URI prefixes. Currently this will be tracked as three identifier47710
,HGNC:47710
andhgnc:47710
in the same dataset.HGNC itself consistently identifies a "HGNC ID" with the prefix, e.g.
HGNC:47710
- which is in accordance with the 10 Simple rules for design, provision, and reuse of persistent identifiers for life science data rule 2 to use CURIEs.In Open PHACTS, earlier linksets used the style http://identifiers.org/hgnc/47710 - however @JonathanMELIUS's latest [Ensembl-to-HGNC linkset])(http://bridgedb.org/data/linksets/HomoSapiens/Ensembl_Hs_hgnc.direct.LS.ttl) uses the style http://identifiers.org/hgnc/HGNC:47710 which adds the CURIE to the alternative base - perhaps this is not ideal (and can probably by changed upstream) - anyway as both patterns are accepted the org.bridgedb.rdf entry should be updated to support both.