ec-geolink / design

Design information about the EarthCube Geolink project.
8 stars 1 forks source link

Create list of identifiers and their canonical forms #51

Open amoeba opened 9 years ago

amoeba commented 9 years ago

This issue stems from discussion on the Sep 2 2015 teleconference.

The literal representation of identifiers can come into our graphs in multiple forms, e.g.

We would like to have a canonical form to simplify lives for both producers and consumers.

amoeba commented 9 years ago

I've made a lot of good progress on this but could use (1) some more work to track down information on some of the remaining identifiers and (2) some input on my current set of recommendations.

See https://github.com/ec-geolink/design/blob/master/data/dataone/canonical-identifiers.md

My current recommended form for each identifier is preceded by the text 'Recommend:'

krisnadhi commented 9 years ago

Thanks @amoeba!

Regarding issue #61, do you think it is appropriate if on the base ontology, we create an IdentifierScheme class and generate instances for all those identifier schemes? So, we create a separate OWL file containing something like below? <http://schema.geolink.org/dev/voc/identifier/scheme#doi> rdf:type gl:IdentifierScheme . <http://schema.geolink.org/dev/voc/identifier/scheme#doi> rdfs:seeAlso <http://doi.org> .

bob-arko commented 9 years ago

Remind me .. since DataCite has already published URIs for these terms eg. http://purl.org/spar/datacite/isni http://purl.org/spar/datacite/ark http://purl.org/spar/datacite/doi : why can't we use these ?

(Apologies if this has already been answered.)

On Mon, Sep 14, 2015 at 03:01:33PM -0700, krisnadhi wrote:

Thanks @amoeba!

Regarding issue #61, do you think it is appropriate if on the base ontology, we create an IdentifierScheme class and generate instances for all those identifier schemes? So, we create a separate OWL file containing something like below? <http://schema.geolink.org/dev/voc/identifier/scheme#doi> rdf:type gl:IdentifierScheme . <http://schema.geolink.org/dev/voc/identifier/scheme#doi> rdfs:seeAlso <http://doi.org> .


Reply to this email directly or view it on GitHub: https://github.com/ec-geolink/design/issues/51#issuecomment-140217421

amoeba commented 9 years ago

@robertarko You responded just as I was typing this up:

Interesting idea. Could you maybe explain what doing that would add to our efforts? I think I like the idea.

The identifiers I'm researching the serializations for are all in the DataCite ontology as NamedIndividuals. For identifiers in our graphs that have schemes that already exist as NI's in DataCite, I would prefer to use DataCite. But I expect we will have identifiers not in DataCite and possibly not in another ontology so we will need to do something like you've suggested to accommodate them.

bob-arko commented 9 years ago

Right, I see your point. I was assuming the DataCite vocabulary is comprehensive.

Out of curiosity , what are some ID types you need, that DataCite doesn't have?

On Mon, Sep 14, 2015 at 03:14:08PM -0700, Bryce Mecum wrote:

@robertarko You responded just as I was typing this up:

Interesting idea. Could you maybe explain what doing that would add to our efforts? I think I like the idea.

The identifiers I'm researching the serializations for are all in the DataCite ontology as NamedIndividuals. For identifiers in our graphs that have schemes that already exist as NI's in DataCite, I would prefer to use on DataCite. But I expect we will have identifiers not in DataCite and possibly not in another ontology so we will need to do something like you've suggested to accommodate them.

amoeba commented 9 years ago

I don't think there are any in the DataOne network. @mbjones might know otherwise, but I think identifiers in DataOne will always map directly to DataCite. Many of ours are DataCite local-resource-identifiers.

bob-arko commented 9 years ago

Okay. So maybe we can just adopt DataCite's vocabulary outright? (ie. No need to create a new set of classes in schema.geolink.org.) Since DataCite has generic fallback options for ID types like "local-resource-identifier-scheme" and "url", that's probably everything we need.

On Mon, Sep 14, 2015 at 03:43:26PM -0700, Bryce Mecum wrote:

I don't think there are any in the DataOne network. @mbjones might know otherwise, but I think identifiers in DataOne will always map directly to DataCite. Many of ours are DataCite local-resource-identifiers.

mbjones commented 9 years ago

I have yet to find one that we've needed in DataONE that isn't already in the DataCite vocabulary (especially since they support url and urn identifiers as types).

krisnadhi commented 9 years ago

I didn't know the extent of DataCite vocabulary for identifier schemes, hence my earlier comment. If an appropriate one to use is available from DataCite already, I am also in favor of using it, instead of inventing our own URI.

@robertarko, are IMA identifier schemes covered by anything from DataCite other than the generic fallback options?

Referring to #61, hasIdentifierScheme is proposed to be changed to an object property. For this purpose, I suggest adding IdentifierScheme class, which would be aligned to datacite:IdentifierScheme. The identifier scheme URIs like http://purl.org/spar/datacite/ark would be an instance of this class.

mbjones commented 9 years ago

@krisnadhi Sounds good to me.

@amoeba and I just discussed being careful about the definitions of our properties. For example, we should clarify that there are two use cases for hasIdentifierValue, one to get the machine-readable URI for the Identifier, and one to get the display form. The URI version of an identifier can and should be used as the LOD URI for the Identifier instance itself, except when an anonymous Identifier node is to be used. In which case, does the hasIdentifierValue property contain a literal showing the properly formatted syntax for displaying the identifier (e.g., "doi:10.xxxx/foo42") or the machine readable URI for the identifier (e.g., http://doi.org/10.xxxx/foo42). And, in the case of the latter, where does a client application find the display form for the Identifier? Maybe we need another property such as hasIdentifierDisplayValue.

krisnadhi commented 9 years ago

Why can't we use the value pointed by hasIdentifierValue property for the display form of the identifier?

mbjones commented 9 years ago

@krisnadhi We could, and that's how I originally thought of it. But, will everyone be using it that way? Would we be happy with the definition of the hasIdentifierValue property as "Provides a string literal value that represents the proper form of the identifier for display purposes."? So, for a DOI, this would be of the form "doi:10.xxxx/foo42".

sparkji commented 9 years ago

Hi Bryce,

Could you also add 'GVP', 'SCAR', 'InterRige', 'IMA' and 'IGSN' to canonical-identifiers list?

GVP Smithsonian's Global Volcanism Program (GVP) announces new and permanent unique identifiers (Volcano Numbers, or VNums) for volcanoes documented in the Volcanoes of the World (VOTW) database maintained by GVP and accessible at www.volcano.si.edu.

Source:

http://volcano.si.edu/list_volcano_holocene.cfm

Examples:

GVP:210010 http://volcano.si.edu/volcano.cfm?vn=210010

SCAR The Scientific Committee on Antarctic Research (SCAR), through its recommendations, expresses the hope that the present effort will contribute to the adoption in Antarctica of the general principle of 'one name per feature' by all Antarctic place naming authorities. Source:

https://www1.data.antarctica.gov.au/aadc/gaz/scar/information.cfm

Examples:

SCAR:883

Notes:

It does not publish URIs that speak RDF

InterRidge The InterRidge Global Database of Active Submarine Hydrothermal Vent Fields, hereafter referred to as the “InterRidge Vents Database,” is to provide a comprehensive list of active and inferred active (unconfirmed) submarine hydrothermal vent fields for use in academic research and education.

Source:

http://vents-data.interridge.org/about_the_database

Examples:

InterRidge:13-n-ridge-site http://vents-data.interridge.org/ventfield/13-n-ridge-site

Notes:

It speaks RDF from version 3, and provide sparkql endpoint http://vents-data.interridge.org/sparql http://vents-data.interridge.org/sparql

IMA

http://www.ima-mineralogy.org/Minlist.htm

On Mon, Sep 14, 2015 at 2:13 PM, Bryce Mecum notifications@github.com wrote:

I've made a lot of good progress on this but could use (1) some more work to track down information on some of the remaining identifiers and (2) some input on my current set of recommendations.

See https://github.com/ec-geolink/design/blob/master/data/dataone/canonical-identifiers.md

My current recommended form for each identifier is preceded by the text 'Recommend:'

— Reply to this email directly or view it on GitHub https://github.com/ec-geolink/design/issues/51#issuecomment-140164010.

sparkji commented 9 years ago

Hi Bryce,

Could you also add 'GVP', 'SCAR', 'InterRige', 'IMA' and 'IGSN' to canonical-identifiers list?

GVP Smithsonian's Global Volcanism Program (GVP) announces new and permanent unique identifiers (Volcano Numbers, or VNums) for volcanoes documented in the Volcanoes of the World (VOTW) database maintained by GVP and accessible at www.volcano.si.edu.

Source:

http://volcano.si.edu/list_volcano_holocene.cfm

Examples:

GVP:210010 http://volcano.si.edu/volcano.cfm?vn=210010

SCAR The Scientific Committee on Antarctic Research (SCAR), through its recommendations, expresses the hope that the present effort will contribute to the adoption in Antarctica of the general principle of 'one name per feature' by all Antarctic place naming authorities. Source:

https://www1.data.antarctica.gov.au/aadc/gaz/scar/information.cfm

Examples:

SCAR:883

Notes:

It does not publish URIs that speak RDF

InterRidge The InterRidge Global Database of Active Submarine Hydrothermal Vent Fields, hereafter referred to as the “InterRidge Vents Database,” is to provide a comprehensive list of active and inferred active (unconfirmed) submarine hydrothermal vent fields for use in academic research and education.

Source:

http://vents-data.interridge.org/about_the_database

Examples:

InterRidge:13-n-ridge-site http://vents-data.interridge.org/ventfield/13-n-ridge-site

Notes:

It speaks RDF from version 3, and provide sparkql endpoint http://vents-data.interridge.org/sparql http://vents-data.interridge.org/sparql

IMA International Mineralogical Association (IMA) publish the list contains names and data for minerals which have been approved, discredited, redefined and renamed and is the new revised master list of all IMA-approved and grandfathered (i.e. inherited from before 1960) minerals. Source:

http://www.ima-mineralogy.org/Minlist.htm

Examples:

IMA:2014-028

Notes:

It does not publish the URIs that speak RDF

IGSN IGSN stands for International Geo Sample Number. The IGSN is 9-digit alphanumeric code that uniquely identifies samples from our natural environment and related sampling features. You can get an IGSN for your sample by registering it in the System for Earth Sample Registration SESAR. Source:

http://www.geosamples.org/igsnabout

Examples:

IGSN:HRV003M16

On Mon, Sep 14, 2015 at 2:13 PM, Bryce Mecum notifications@github.com wrote:

I've made a lot of good progress on this but could use (1) some more work to track down information on some of the remaining identifiers and (2) some input on my current set of recommendations.

See https://github.com/ec-geolink/design/blob/master/data/dataone/canonical-identifiers.md

My current recommended form for each identifier is preceded by the text 'Recommend:'

— Reply to this email directly or view it on GitHub https://github.com/ec-geolink/design/issues/51#issuecomment-140164010.

amoeba commented 9 years ago

@sparkji I'll add those to the list today. Thanks for providing all that information too -- it helps a lot!

krisnadhi commented 9 years ago

@krisnadhi We could, and that's how I originally thought of it. But, will everyone be using it that way? Would we be happy with the definition of the hasIdentifierValue property as "Provides a string literal value that represents the proper form of the identifier for display purposes."? So, for a DOI, this would be of the form "doi:10.xxxx/foo42".

Would that be the only purpose of the value pointed to by hasIdentifierValue property? If that were the case, then it would be better to use rdfs:label and simply drop hasIdentifierValue property. IMHO, hasIdentifierValue implicitly captures our intention that the value it points to is really the identifier value and hence, consumers can assume that typical characteristics of identifiers hold, e.g., uniqueness of the value in the context of the identifier scheme. Obviously, the same value can still be used for display purposes.

One way to avoid confusion regarding how to display the identifier value is to augment the corresponding instance of Identifier class with an rdfs:label annotation whereby the label literal value is copied from the value pointed to by the hasIdentifierValue property.

bob-arko commented 9 years ago

If we represent DOIs as doi:10.xxxx/foo, then will we follow that style consistently? ie. ISNIs (for organizations) would be isni:xyz, ORCIDs (for persons) would be orcid:xyz, IGSNs (for samples) would be igsn:xyz ,etc ?

On Mon, Sep 14, 2015 at 10:55:03PM -0700, Matt Jones wrote:

@krisnadhi We could, and that's how I originally thought of it. But, will everyone be using it that way? Would we be happy with the definition of the hasIdentifierValue property as "Provides a string literal value that represents the proper form of the identifier for display purposes."? So, for a DOI, this would be of the form "doi:10.xxxx/foo42".

bob-arko commented 9 years ago

PS. Coincidentally we're discussing similar issues in the EarthCube workshop this week.

One thing that worries me, is how Publishers will implement identifiers in journal articles. If they follow the DataCite approach (scheme and value), then they may implement business logic that always/automatically prepends the scheme to the value. So we'll end up with DOIs that look like

doi:doi:10.xxxx/foo42

On Mon, Sep 14, 2015 at 10:55:03PM -0700, Matt Jones wrote:

@krisnadhi We could, and that's how I originally thought of it. But, will everyone be using it that way? Would we be happy with the definition of the hasIdentifierValue property as "Provides a string literal value that represents the proper form of the identifier for display purposes."? So, for a DOI, this would be of the form "doi:10.xxxx/foo42".


Reply to this email directly or view it on GitHub: https://github.com/ec-geolink/design/issues/51#issuecomment-140288685

krisnadhi commented 9 years ago

If we represent DOIs as doi:10.xxxx/foo, then will we follow that style consistently? ie. ISNIs (for organizations) would be isni:xyz, ORCIDs (for persons) would be orcid:xyz, IGSNs (for samples) would be igsn:xyz ,etc ?

Actually, @amoeba's note already indicates that this style is not necessarily appropriate for some identifier scheme.

One thing that worries me, is how Publishers will implement identifiers in journal articles. If they follow the DataCite approach (scheme and value), then they may implement business logic that always/automatically prepends the scheme to the value. So we'll end up with DOIs that look like

doi:doi:10.xxxx/foo42

Is this business logic more on the data publishing or data consumption? If this is about data publishing side, wouldn't it be a reasonable assumption that data publishers would ensure that their data are nicely formatted, e.g., they wouldn't publish a DOI literal that has two doi prefixes? So, we are okay as long as we have a set of recommended canonical forms that data publishers should use when publishing within GeoLink framework. If this is more about a data consumption side, then I think, we shouldn't worry too much about how the business logic in the data consuming application is implemented as long as we use consistent styles when pushing out the data via GeoLink public endpoint.

mbjones commented 9 years ago

I agree with @krisnadhi that the consumers need to intelligently consume the identifiers because there are so many ways of representing things, and the recommended best practice for how to reference identifiers is a moving target. Plus, some groups like the DOI foundation make both a display recomendation (DOI:10.xxxx/foo) and a machine-readable link recommendation (e.g., http://dx.doi.org/10.xxxx/foo, http://doi.org/10.xxxx/foo over time). I think the issue here is that we need to know where these two types of information (display and link) will be recorded in glbase, and which is which. I'm not enamored of rdfs:label because it is used so loosely, and sometimes contains garbage text. I would prefer targeted properties for identifierDisplayForm and identifierLinkForm, regardless of the naming we end up with.

amoeba commented 8 years ago

Just checking in on this issue as I don't think we've resolved it just yet.

From the comments, it looks like we need to decide between whether we want the display form, machine-readable form, or a web-resolvable form (or some combination of the three) to be stored in our graphs and how we want to do that. We could use rdfs:label for the display form, and glview:hasIdentifierValue for the machine-readable form, but we might want to create a new property like glview:hasIdentifierDisplayForm as @mbjones suggested. I'm pretty stuck on what to recommend.

Identifiers can have (1) a value, (2) a display form, and/or (3) an HTTP resolvable form, with some of these forms being the same for some identifiers. What should we be storing in our graphs?

amoeba commented 8 years ago

As per our 2015/11/18 telecon, I will complete a first-draft of the identifier recommendations for the group to review. I'll have this done for the next telecon on 2015/12/2.