Open amoeba opened 9 years ago
I've made a lot of good progress on this but could use (1) some more work to track down information on some of the remaining identifiers and (2) some input on my current set of recommendations.
See https://github.com/ec-geolink/design/blob/master/data/dataone/canonical-identifiers.md
My current recommended form for each identifier is preceded by the text 'Recommend:'
Thanks @amoeba!
Regarding issue #61, do you think it is appropriate if on the base ontology, we create an IdentifierScheme
class and generate instances for all those identifier schemes? So, we create a separate OWL file containing something like below?
<http://schema.geolink.org/dev/voc/identifier/scheme#doi> rdf:type gl:IdentifierScheme .
<http://schema.geolink.org/dev/voc/identifier/scheme#doi> rdfs:seeAlso <http://doi.org> .
Remind me .. since DataCite has already published URIs for these terms eg. http://purl.org/spar/datacite/isni http://purl.org/spar/datacite/ark http://purl.org/spar/datacite/doi : why can't we use these ?
(Apologies if this has already been answered.)
On Mon, Sep 14, 2015 at 03:01:33PM -0700, krisnadhi wrote:
Thanks @amoeba!
Regarding issue #61, do you think it is appropriate if on the base ontology, we create an
IdentifierScheme
class and generate instances for all those identifier schemes? So, we create a separate OWL file containing something like below?<http://schema.geolink.org/dev/voc/identifier/scheme#doi> rdf:type gl:IdentifierScheme .
<http://schema.geolink.org/dev/voc/identifier/scheme#doi> rdfs:seeAlso <http://doi.org> .
Reply to this email directly or view it on GitHub: https://github.com/ec-geolink/design/issues/51#issuecomment-140217421
@robertarko You responded just as I was typing this up:
Interesting idea. Could you maybe explain what doing that would add to our efforts? I think I like the idea.
The identifiers I'm researching the serializations for are all in the DataCite ontology as NamedIndividuals. For identifiers in our graphs that have schemes that already exist as NI's in DataCite, I would prefer to use DataCite. But I expect we will have identifiers not in DataCite and possibly not in another ontology so we will need to do something like you've suggested to accommodate them.
Right, I see your point. I was assuming the DataCite vocabulary is comprehensive.
Out of curiosity , what are some ID types you need, that DataCite doesn't have?
On Mon, Sep 14, 2015 at 03:14:08PM -0700, Bryce Mecum wrote:
@robertarko You responded just as I was typing this up:
Interesting idea. Could you maybe explain what doing that would add to our efforts? I think I like the idea.
The identifiers I'm researching the serializations for are all in the DataCite ontology as NamedIndividuals. For identifiers in our graphs that have schemes that already exist as NI's in DataCite, I would prefer to use on DataCite. But I expect we will have identifiers not in DataCite and possibly not in another ontology so we will need to do something like you've suggested to accommodate them.
I don't think there are any in the DataOne network. @mbjones might know otherwise, but I think identifiers in DataOne will always map directly to DataCite. Many of ours are DataCite local-resource-identifiers.
Okay. So maybe we can just adopt DataCite's vocabulary outright? (ie. No need to create a new set of classes in schema.geolink.org.) Since DataCite has generic fallback options for ID types like "local-resource-identifier-scheme" and "url", that's probably everything we need.
On Mon, Sep 14, 2015 at 03:43:26PM -0700, Bryce Mecum wrote:
I don't think there are any in the DataOne network. @mbjones might know otherwise, but I think identifiers in DataOne will always map directly to DataCite. Many of ours are DataCite local-resource-identifiers.
I have yet to find one that we've needed in DataONE that isn't already in the DataCite vocabulary (especially since they support url and urn identifiers as types).
I didn't know the extent of DataCite vocabulary for identifier schemes, hence my earlier comment. If an appropriate one to use is available from DataCite already, I am also in favor of using it, instead of inventing our own URI.
@robertarko, are IMA identifier schemes covered by anything from DataCite other than the generic fallback options?
Referring to #61, hasIdentifierScheme
is proposed to be changed to an object property. For this purpose, I suggest adding IdentifierScheme
class, which would be aligned to datacite:IdentifierScheme
. The identifier scheme URIs like http://purl.org/spar/datacite/ark would be an instance of this class.
@krisnadhi Sounds good to me.
@amoeba and I just discussed being careful about the definitions of our properties. For example, we should clarify that there are two use cases for hasIdentifierValue
, one to get the machine-readable URI for the Identifier, and one to get the display form. The URI version of an identifier can and should be used as the LOD URI for the Identifier instance itself, except when an anonymous Identifier node is to be used. In which case, does the hasIdentifierValue property contain a literal showing the properly formatted syntax for displaying the identifier (e.g., "doi:10.xxxx/foo42") or the machine readable URI for the identifier (e.g., http://doi.org/10.xxxx/foo42). And, in the case of the latter, where does a client application find the display form for the Identifier? Maybe we need another property such as hasIdentifierDisplayValue
.
Why can't we use the value pointed by hasIdentifierValue
property for the display form of the identifier?
@krisnadhi We could, and that's how I originally thought of it. But, will everyone be using it that way? Would we be happy with the definition of the hasIdentifierValue property as "Provides a string literal value that represents the proper form of the identifier for display purposes."? So, for a DOI, this would be of the form "doi:10.xxxx/foo42".
Hi Bryce,
Could you also add 'GVP', 'SCAR', 'InterRige', 'IMA' and 'IGSN' to canonical-identifiers list?
GVP Smithsonian's Global Volcanism Program (GVP) announces new and permanent unique identifiers (Volcano Numbers, or VNums) for volcanoes documented in the Volcanoes of the World (VOTW) database maintained by GVP and accessible at www.volcano.si.edu.
Source:
http://volcano.si.edu/list_volcano_holocene.cfm
Examples:
GVP:210010 http://volcano.si.edu/volcano.cfm?vn=210010
SCAR The Scientific Committee on Antarctic Research (SCAR), through its recommendations, expresses the hope that the present effort will contribute to the adoption in Antarctica of the general principle of 'one name per feature' by all Antarctic place naming authorities. Source:
https://www1.data.antarctica.gov.au/aadc/gaz/scar/information.cfm
Examples:
SCAR:883
Notes:
It does not publish URIs that speak RDF
InterRidge The InterRidge Global Database of Active Submarine Hydrothermal Vent Fields, hereafter referred to as the “InterRidge Vents Database,” is to provide a comprehensive list of active and inferred active (unconfirmed) submarine hydrothermal vent fields for use in academic research and education.
Source:
http://vents-data.interridge.org/about_the_database
Examples:
InterRidge:13-n-ridge-site http://vents-data.interridge.org/ventfield/13-n-ridge-site
Notes:
It speaks RDF from version 3, and provide sparkql endpoint http://vents-data.interridge.org/sparql http://vents-data.interridge.org/sparql
IMA
http://www.ima-mineralogy.org/Minlist.htm
On Mon, Sep 14, 2015 at 2:13 PM, Bryce Mecum notifications@github.com wrote:
I've made a lot of good progress on this but could use (1) some more work to track down information on some of the remaining identifiers and (2) some input on my current set of recommendations.
See https://github.com/ec-geolink/design/blob/master/data/dataone/canonical-identifiers.md
My current recommended form for each identifier is preceded by the text 'Recommend:'
— Reply to this email directly or view it on GitHub https://github.com/ec-geolink/design/issues/51#issuecomment-140164010.
Hi Bryce,
Could you also add 'GVP', 'SCAR', 'InterRige', 'IMA' and 'IGSN' to canonical-identifiers list?
GVP Smithsonian's Global Volcanism Program (GVP) announces new and permanent unique identifiers (Volcano Numbers, or VNums) for volcanoes documented in the Volcanoes of the World (VOTW) database maintained by GVP and accessible at www.volcano.si.edu.
Source:
http://volcano.si.edu/list_volcano_holocene.cfm
Examples:
GVP:210010 http://volcano.si.edu/volcano.cfm?vn=210010
SCAR The Scientific Committee on Antarctic Research (SCAR), through its recommendations, expresses the hope that the present effort will contribute to the adoption in Antarctica of the general principle of 'one name per feature' by all Antarctic place naming authorities. Source:
https://www1.data.antarctica.gov.au/aadc/gaz/scar/information.cfm
Examples:
SCAR:883
Notes:
It does not publish URIs that speak RDF
InterRidge The InterRidge Global Database of Active Submarine Hydrothermal Vent Fields, hereafter referred to as the “InterRidge Vents Database,” is to provide a comprehensive list of active and inferred active (unconfirmed) submarine hydrothermal vent fields for use in academic research and education.
Source:
http://vents-data.interridge.org/about_the_database
Examples:
InterRidge:13-n-ridge-site http://vents-data.interridge.org/ventfield/13-n-ridge-site
Notes:
It speaks RDF from version 3, and provide sparkql endpoint http://vents-data.interridge.org/sparql http://vents-data.interridge.org/sparql
IMA International Mineralogical Association (IMA) publish the list contains names and data for minerals which have been approved, discredited, redefined and renamed and is the new revised master list of all IMA-approved and grandfathered (i.e. inherited from before 1960) minerals. Source:
http://www.ima-mineralogy.org/Minlist.htm
Examples:
IMA:2014-028
Notes:
It does not publish the URIs that speak RDF
IGSN IGSN stands for International Geo Sample Number. The IGSN is 9-digit alphanumeric code that uniquely identifies samples from our natural environment and related sampling features. You can get an IGSN for your sample by registering it in the System for Earth Sample Registration SESAR. Source:
http://www.geosamples.org/igsnabout
Examples:
IGSN:HRV003M16
On Mon, Sep 14, 2015 at 2:13 PM, Bryce Mecum notifications@github.com wrote:
I've made a lot of good progress on this but could use (1) some more work to track down information on some of the remaining identifiers and (2) some input on my current set of recommendations.
See https://github.com/ec-geolink/design/blob/master/data/dataone/canonical-identifiers.md
My current recommended form for each identifier is preceded by the text 'Recommend:'
— Reply to this email directly or view it on GitHub https://github.com/ec-geolink/design/issues/51#issuecomment-140164010.
@sparkji I'll add those to the list today. Thanks for providing all that information too -- it helps a lot!
@krisnadhi We could, and that's how I originally thought of it. But, will everyone be using it that way? Would we be happy with the definition of the hasIdentifierValue property as "Provides a string literal value that represents the proper form of the identifier for display purposes."? So, for a DOI, this would be of the form "doi:10.xxxx/foo42".
Would that be the only purpose of the value pointed to by hasIdentifierValue
property? If that were the case, then it would be better to use rdfs:label and simply drop hasIdentifierValue
property. IMHO, hasIdentifierValue
implicitly captures our intention that the value it points to is really the identifier value and hence, consumers can assume that typical characteristics of identifiers hold, e.g., uniqueness of the value in the context of the identifier scheme. Obviously, the same value can still be used for display purposes.
One way to avoid confusion regarding how to display the identifier value is to augment the corresponding instance of Identifier
class with an rdfs:label annotation whereby the label literal value is copied from the value pointed to by the hasIdentifierValue
property.
If we represent DOIs as doi:10.xxxx/foo, then will we follow that style consistently? ie. ISNIs (for organizations) would be isni:xyz, ORCIDs (for persons) would be orcid:xyz, IGSNs (for samples) would be igsn:xyz ,etc ?
On Mon, Sep 14, 2015 at 10:55:03PM -0700, Matt Jones wrote:
@krisnadhi We could, and that's how I originally thought of it. But, will everyone be using it that way? Would we be happy with the definition of the hasIdentifierValue property as "Provides a string literal value that represents the proper form of the identifier for display purposes."? So, for a DOI, this would be of the form "doi:10.xxxx/foo42".
PS. Coincidentally we're discussing similar issues in the EarthCube workshop this week.
One thing that worries me, is how Publishers will implement identifiers in journal articles. If they follow the DataCite approach (scheme and value), then they may implement business logic that always/automatically prepends the scheme to the value. So we'll end up with DOIs that look like
doi:doi:10.xxxx/foo42
On Mon, Sep 14, 2015 at 10:55:03PM -0700, Matt Jones wrote:
@krisnadhi We could, and that's how I originally thought of it. But, will everyone be using it that way? Would we be happy with the definition of the hasIdentifierValue property as "Provides a string literal value that represents the proper form of the identifier for display purposes."? So, for a DOI, this would be of the form "doi:10.xxxx/foo42".
Reply to this email directly or view it on GitHub: https://github.com/ec-geolink/design/issues/51#issuecomment-140288685
If we represent DOIs as doi:10.xxxx/foo, then will we follow that style consistently? ie. ISNIs (for organizations) would be isni:xyz, ORCIDs (for persons) would be orcid:xyz, IGSNs (for samples) would be igsn:xyz ,etc ?
Actually, @amoeba's note already indicates that this style is not necessarily appropriate for some identifier scheme.
One thing that worries me, is how Publishers will implement identifiers in journal articles. If they follow the DataCite approach (scheme and value), then they may implement business logic that always/automatically prepends the scheme to the value. So we'll end up with DOIs that look like
doi:doi:10.xxxx/foo42
Is this business logic more on the data publishing or data consumption? If this is about data publishing side, wouldn't it be a reasonable assumption that data publishers would ensure that their data are nicely formatted, e.g., they wouldn't publish a DOI literal that has two doi prefixes? So, we are okay as long as we have a set of recommended canonical forms that data publishers should use when publishing within GeoLink framework. If this is more about a data consumption side, then I think, we shouldn't worry too much about how the business logic in the data consuming application is implemented as long as we use consistent styles when pushing out the data via GeoLink public endpoint.
I agree with @krisnadhi that the consumers need to intelligently consume the identifiers because there are so many ways of representing things, and the recommended best practice for how to reference identifiers is a moving target. Plus, some groups like the DOI foundation make both a display recomendation (DOI:10.xxxx/foo) and a machine-readable link recommendation (e.g., http://dx.doi.org/10.xxxx/foo, http://doi.org/10.xxxx/foo over time). I think the issue here is that we need to know where these two types of information (display and link) will be recorded in glbase, and which is which. I'm not enamored of rdfs:label because it is used so loosely, and sometimes contains garbage text. I would prefer targeted properties for identifierDisplayForm and identifierLinkForm, regardless of the naming we end up with.
Just checking in on this issue as I don't think we've resolved it just yet.
From the comments, it looks like we need to decide between whether we want the display form, machine-readable form, or a web-resolvable form (or some combination of the three) to be stored in our graphs and how we want to do that. We could use rdfs:label
for the display form, and glview:hasIdentifierValue
for the machine-readable form, but we might want to create a new property like glview:hasIdentifierDisplayForm
as @mbjones suggested. I'm pretty stuck on what to recommend.
Identifiers can have (1) a value, (2) a display form, and/or (3) an HTTP resolvable form, with some of these forms being the same for some identifiers. What should we be storing in our graphs?
As per our 2015/11/18 telecon, I will complete a first-draft of the identifier recommendations for the group to review. I'll have this done for the next telecon on 2015/12/2.
This issue stems from discussion on the Sep 2 2015 teleconference.
The literal representation of identifiers can come into our graphs in multiple forms, e.g.
We would like to have a canonical form to simplify lives for both producers and consumers.