cf-convention / cf-conventions

AsciiDoc Source
http://cfconventions.org/cf-conventions/cf-conventions
Creative Commons Zero v1.0 Universal
83 stars 43 forks source link

Add attribute citation_id #160

Closed castelao closed 1 year ago

castelao commented 5 years ago

THIS IS OUTDATED. I'm editing this proposal to reflect the discussions so far, but I'll save a copy of this original proposal.

Title: DOI attribute

Moderator: to be defined

Requirement Summary: Optional DOI attribute in section Description of file contents (2.6.2).

Technical Proposal Summary: Add a new optional attribute to designate the Digital Object Identifier (DOI) of the data contained in the CF data object.

Benefits: DOIs allow easy automation for tracking the scientific impact of the data on the exact same fashion that scientific publications are tracked with DOIs. Anyone involved in the resulted data can be recognized, including funding agencies.

Status Quo: An increasing number of scientific journals start to require a DOI for the dataset used in the publication. Many groups already include DOI as an attribute in its NetCDF-CF datasets but without a standard, thus hard to automate.

Detailed Proposal: The only modification required would be in section 2.6.2: Description of file contents. In the bottom, after item comment, it would be added:

doi: Digital object identifier (DOI) of the dataset. For simplicity, the proxy part
       of the DOI is dropped, so it is composed by the suffix plus the prefix only,
       e.g. “10.21238/S8SPRAY1618”.

As mentioned in the 2.6.2 section, all attributes are optional, and the doi would follow the same rule.

This propose was developed with the help of @kenkehoe

Reasoning:

DOI is a de facto standard to track academic publications, thus providing the foundation for some measurement of scientific impact. There is a clear intention by the scientific community to also track the scientific impact of data and software, thus giving proper credit for who makes those available. The strategy adopted by AMS journals, and more recently by AGU, was to require citation of the dataset DOI used in any publication in the references list (https://www.ametsoc.org/ams/index.cfm/publications/authors/journal-and-bams-authors/formatting-and-manuscript-components/references/dataset-references/). The use of DOI for datasets will increase. A few groups already include the dataset DOI in its NetCDF-CF data files, but without a standard, it is hard for a machine to keep track of that.

Justification:

Tiny background on DOI:

Details

Example

// global attributes:
    :Conventions = "CF-1.7, ACDD-1.3" ;
    :title = "California Underwater Glider Network" ;
    :featureType = "trajectoryProfile" ;
    :id = "CUGN_90" ;
    :standard_name_vocabulary = "CF Standard Name Table v62" ;
    :doi = "10.21238/S8SPRAY1618" ;
taylor13 commented 5 years ago

Is it really o.k. to assign a doi to a document or a file (or dataset) and subsequently change the metadata? If it's not, I think we need to make that clear, and it would seem impossible to include the doi in a file to which it was assigned because that would involve altering the file.

castelao commented 5 years ago

Just to be clear, here I'll refer to metadata as the one in the doi database, which is not explicitly shown in the data file.

Yes, it is ok to modify the metadata associated to one doi later, an easy example is to include a new description in another language, or add a new reference that cites the dataset in consideration.

A doi can be used to identify one very specific version of a dataset, maybe for reproducibility purpose, but that is not the only possible use. A doi can be used to assign all data collected from one project or a group of numerical simulations, even if that dataset is split into multiple files. Note that doi is not a checksum. If one finds a typo in the creator email, in the file or in the doi database, I don't see that as a reason to generate a new doi to reflect that change.

jhausman commented 5 years ago

Having DOI for the dataset is great. It provides an easier way to capture metrics of the data usage, provided that the data producer or repository has a way of minting and registering DOIs. At NASA ESDIS they register the DOI, but do not create it. If we work with the provider early enough we can "hold" a DOI for them so they can populate the field before they produce the whole dataset. As long as you don't make the field required it should be added in. I would also suggest that a citation field be added as well, so a user can see how the data should be cited.

martinjuckes commented 5 years ago

Zenodo.org also allows you to reserve a DOI before uploading your data, but this may not always be possible. At the CEDA archive, for instance, we like to verify the data before issuing a DOI: it is really up to the publisher.

Would you consider an alternative, slightly more flexible approach such as:

:resource_identifier = "doi:10.21238/S8SPRAY1618" ;

This would also support other forms of permanent identifiers. CMIP, for instance, is using the Handle System (which is also the system used to maintain the DOIs, but has wider use). For massive numbers of files used by CMIP, DOIs for each file are not appropriate, but we do have a closely related identifier in each file. At the moment this uses a global attribute defined by CMIP .. it would be nice to have it brought into the CF standard (for use in CMIP7). The CMIP version would then be something like:

:resource_identifier = "hdl:21.14100/d9a7225a-49c3-4470-b7ab-a8180926f839" ;

(to resolve this, paste it into the text box at http://proxy.handle.net/ ).

PS: .. or we could use tracking_id instead of resource_identifier, which would make this nicely compatible with what is being done in the CMIP6 archive.

TobiasWeigel commented 5 years ago

I second Martin's suggestion to broaden the scope a bit and allow both "doi:" and "hdl:" as prefixes, and also think it would be very wise in the long term to name the field ":resource_identifier" instead of just ":doi" to keep flexiblity.

There would be little technical drawback in doing this: The DOI System is technically based on the Handle System, so both are compatible. Putting a "doi:" or "hdl:" in front of the string will not cause problems with resolutions of either, as both DOI and Handle resolvers understand these well. But recording the difference can make tool implementation easier that needs to treat these cases differently.

martinjuckes commented 5 years ago

There is, for the DOI, a question as to whether the DOI should be verifiable. This is a problem if you want to use reserved DOIs: the CF checker would not be able to validate a reserved DOI until after the file is published and the DOI is finally released. This creates a validation loophole which would not be serious if you are dealing with one or two files, but if you are processing hundreds, let alone millions as in CMIP6, this would not be acceptable.

This could be avoided by using a collection DOI, as described by @castelao.

If you want a file to include a resolvable string that references the file itself, the Handle is really a much better approach than the DOI, because you can build a robust system supporting validation before publication (whether people actually implement that is another question, but I believe the standard should at least support validation).

@castelao : would your use-case be supported if the use of a DOI was recommended to be only for collection DOIs which can be validated before the file is published?

castelao commented 5 years ago

Thanks @jhausman! I agree with the importance of a citation instruction. Cite data is a new thing, and there is much confusion on the best way to do that. At the moment I only suggest how to cite the Spray data (my work) in the landing page, but you're right, I should include it somehow in the netCDF.

I would be more inclined to have that information added in one of the text fields like summary or comment. Maybe in the field references if clearly stated that it is the dataset itself reference. The main reason that I'm suggesting a field for the doi is to make it easier for machine reading, while the 'how to cite' would be certainly for human reading which could easily understand the embedded text. If the idea is the automation for the actual citation text, the doi.org has an API that returns that in different standards. In Python I use that like:

    headers = {
            'Accept': 'text/x-bibliography; style=apa',
            }
    response = requests.get('https://doi.org/{0}'.format(doi),
                                timeout=2,
                                headers=headers)
castelao commented 5 years ago

Thanks @martinjuckes , I wasn't aware of that standard for CMIP7. I like very much the idea of generalizing it. In that case, would it make sense a single file with both, doi and hdl at the same time? If so, the resource_identifier should allow a list of identifiers?

About the checker, I think the solution is to have different levels of alerts for the checker. The production level checker would require a valid doi and/or hdl, while a development level would only create a warning if it couldn't resolve that. I've been using only a collection level DOI, but I think we should not restrict others to that.

Yes @TobiasWeigel, I agree. In this case we should include the "doi:" or "hdl:".

@jhausman, does NASA uses hdl? If so, how do you include them in the files?

martinjuckes commented 5 years ago

Allowing a list of identifiers looks like a good idea to me. There may be cases where people wish to record a collection level identifier and a file level identifier.

I'm not keen on the idea of different outputs from the checker for different stages of production: I feel that this would be difficult to implement without getting into a discussion of the many different workflows which may be used to generate datasets with embedded identifiers, which could be a rather open-ended discussion. It may be better to point out that users may need to filter the checker output if they are running it on data with unpublished identifiers .. the nature of the filter would depend on their workflow.

castelao commented 5 years ago

I think that CMIP7 is argument enough to change for resource_identifier, but I would like to wait a few days in case someone has a good argument against that or any other idea.

@kenkehoe, what is your opinion about using resource_identifier instead?

JimBiardCics commented 5 years ago

I think the checker shouldn't be expected to verify the contents of a resource_identifier attribute. That is a significant increase in scope for the checker.

JimBiardCics commented 5 years ago

The proposed resource_identifier attribute is, in fact, functionally identical to the ACDD id attribute. I think we should leave it to ACDD to manage, rather than add it to CF. The same goes for cite metadata. If you feel that there is a need to add to, expand on, or improve the existing ACDD attributes, you should take it up with ESIP.

castelao commented 5 years ago

Thanks for your input @JimBiardCics, but that is not correct. While ACDD defined the id as "An identifier for the data set, provided by and unique within its naming authority.", I wrote above:

Although other standards allow the use of DOI, for example, the id attribute recommended by ACDD, it conflicts with possible uses of DOIs. For instance, while id is stated to be unique, the same dataset DOI could be used in multiple files with chunks of the dataset assigned by the DOI.

I don't think that the attribute id should be changed, there is value for that as it is.

I would rather use something that already exists, but I can't find any adequate one to assign DOIs.

graybeal commented 5 years ago

I agree that the id attribute is written in a way that precludes the use case of "using the same dataset DOI in multiple files". However, I believe Jim is correct when he says the proposed resource_identifier attribute is identical to the id attribute of ACDD.

The difficulty is that the text defining the doi, "Digital object identifier (DOI) of the dataset." is easily misinterpreted to mean "the DOI of the entire dataset represented by this netCDF file." In which case the DOI many would expect it to be unique to this dataset. (Though I don't know that there is any requirement that a DOI has a 1-to-1 relationship with a dataset, maybe it's OK to mint as many as you want for a particular dataset?)

(Guilherme, was this also proposed elsewhere? I was thinking I responded to it previously, but can't find that.)

Originally I also wanted a more general solution (sometimes other things are used for identifiers, or even citations), but I think ACDD's 'id' is that more general solution. If the DOI is also the id, it can and should appear in both attributes. Given that, 'doi' is the exactly right name for this attribute.

While I have minor issues with some of the justification and background arguments, overall I agree with the principle that it is helpful to have a specific place where the doi can be found. The two things I'd like to see changed: 1) Make clear in the attribute's description that this attribute is describing the DOI citation for the data in the netCDF file, and may also appear in other netCDF files if these data are part of a larger file or collection. 2) Make clear in this proposal and the associated description that netCDF files are meant to be self-describing, and that the DOI-associated metadata must not be a substitute for the metadata in the netCDF file itself. If that is understood, there can be plenty of room for engineering judgment about extending and modifying DOI-associated metadata vs matching the file-contained metadata. I don't think we should engage in the unresolvable debate of whether or when it is OK to have the DOI-associated metadata diverge from what's in the file.

castelao commented 5 years ago

@graybeal, those are all good points, thank you very much!

I'm not familiar with the hdl identifier, and couldn't find much documentation on that. If hdl requires to be unique on the sense that can be used only once, to identify a single file, it would be a legitimate use for ACDD:id, and in that case I would be wrong in suggesting anything different for hdl than using the ACDD:id. I don't know hdl.

I'm not aware of any other proposal for an attribute to address the DOI of the dataset that would allow convenient machine reading.

In theory, one could mint several DOIs for the same object (N->1), but I can think in only a couple of situations that it would justify. More common is the possibility of one single DOI used for a collection of datasets (1->N), or subsets, which violates the uniqueness required by the ACDD:id.

I strongly agree with your two last points, and I'll modify my proposal to include/clarify those issues. Based on #150 I guess I should edit my very first post?! If you have time, I would be interested in learning what are your minor issues with some of the justification and background so I could also address them in my proposal.

JimBiardCics commented 5 years ago

@castelao I've got to disagree with your interpretation of the uniqueness concept in the ACDD id attribute. I also believe that naming an attribute doi is overly limiting.

Regarding the uniqueness of the ACDD id, it seems to me that you are interpreting a dataset as being equal to a file. I don't think that is a valid assumption, and I know that the ACDD id attribute is being used with the assumption that a dataset is composed of many files.

Regarding the name, there are numerous competing schemes for providing persistent identifiers, doi being just one of them. There is an existing, widely used scheme for differentiating them—prefixing with \<namespace>:, where \<namespace> is "doi", "ark", "purl", "urn", etc. as @TobiasWeigel pointed out, so it seems to me that it is overly limiting to create a specific doi attribute. It is not difficult for humans or machines to scan an attribute for a string like "doi:10.2345/A5B3038" and know that it is a doi.

Again, I don't think CF is the right place for this attribute. I think we should leave this sort of metadata to ACDD. I'm sure the ESIP ACDD people would be happy to update or modify the id attribute or come up with a new attribute as needed. They are more knowledgable, as a group, about the persistent identifier topic than we are.

graybeal commented 5 years ago

Jim, I think these are good points, I think long ago I wrote the same points but didn't post them. I landed at a slightly different answer by assuming that the intent was to provide a limited service, that is, a service just for DOI entries, and that specifying how that should be done was a useful proposition.

While I agree that ACDD ID explicitly must be unique for every file, I do not think the intent is to make the doi attribute unique in the same way. (See @castelao's previous post and a narrow reading of the original description.)

I think someone who wants to can use any of those ID types, including DOIs, in the ACDD ID—but they have to be unique to each file.

The thing that arguably makes DOI deserving of its own field is that it is explicitly a citation mechanism, which is different than an identification mechanism. I'm not in love with DOIs as a perfect citation mechanism, and it is not the only citation mechanism (IRIs are accepted in many journals), but I think the science community has spoken to the value of DOIs, through their wide adoption. (And most of my long-ago objections to DOIs have been addressed.)

Should it be in CF? To be clear, I do not want to modify the ACDD id attribute, it seems exactly correct as specified. Adding this as a new attribute to ACDD would be fine in my view. But I think the ACDD is less widely used than CF and less well-known that CF, and so from that standpoint there's a benefit in including it in CF.

Of course, the id should be in CF too, even more so. It's a bit of a puzzle to me why it is not; I think that is a significant weakness in CF. (I could say it is not FAIR, but I can see the hackles going up all over the community at the F-word....) I would go so far as to say I'm uncomfortable putting the doi in CF, without putting the id in CF. Doing so would confuse the purpose of the doi attribute.

John

On Jun 12, 2019, at 6:42 AM, JimBiardCics notifications@github.com<mailto:notifications@github.com> wrote:

@castelaohttps://github.com/castelao I've got to disagree with your interpretation of the uniqueness concept in the ACDD id attribute. I also believe that naming an attribute doi is overly limiting.

Regarding the uniqueness of the ACDD id, it seems to me that you are interpreting a dataset as being equal to a file. I don't think that is a valid assumption, and I know that the ACDD id attribute is being used with the assumption that a dataset is composed of many files.

Regarding the name, there are numerous competing schemes for providing persistent identifiers, doi being just one of them. There is an existing, widely used scheme for differentiating them—prefixing with :, where is "doi", "ark", "purl", "urn", etc. as @TobiasWeigelhttps://github.com/TobiasWeigel pointed out, so it seems to me that it is overly limiting to create a specific doi attribute. It is not difficult for humans or machines to scan an attribute for a string like "doi:10.2345/A5B3038" and know that it is a doi.

Again, I don't think CF is the right place for this attribute. I think we should leave this sort of metadata to ACDD. I'm sure the ESIP ACDD people would be happy to update or modify the id attribute or come up with a new attribute as needed. They are more knowledgable, as a group, about the persistent identifier topic than we are.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/cf-convention/cf-conventions/issues/160?email_source=notifications&email_token=AAJVJUERZ2LAYHSQVPPGG6LP2D4LHA5CNFSM4HCZ7YQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXQOQQI#issuecomment-501278785, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAJVJUB2C6OLI6UFCF6EXV3P2D4LHANCNFSM4HCZ7YQQ.

======================== John Graybeal Technical Program Manager Center for Expanded Data Annotation and Retrieval /+/ NCBO BioPortal Stanford Center for Biomedical Informatics Research 650-736-1632

JimBiardCics commented 5 years ago

@castelao @graybeal It seems to me that CF is focused much more on usage metadata than discovery metadata, whereas ACDD is focused much more on discovery. As such, I think it is a much better fit, even if it is less used. CF could direct people to ACDD rather than duplicate the effort.

If the majority view is that we need an attribute like this in CF, then at the very minimum, let's not call the attribute doi. I see no benefit in having an attribute that is so narrow that someone using a different persistent identifier will find it necessary to ask for yet another attribute. We could call it pid or persistent_id. We could follow the <namespace>:<id> convention and allow multiple elements to be store in the attribute, either as space-separated or comma-separated elements in an attribute string or as separate elements of a string attribute array.

graybeal commented 5 years ago

@castelao Here are a few minor points.

Technical Proposal Summary: Add a new optional attribute to designate the Digital Object Identifier (DOI) of the data contained in the CF data object.

This implies there can only be one DOI for the data in the data object. But imagine 3 layers of nested data -- at each layer a different DOI may apply, so the lowest-level data has 3 DOIs, in some sense. I might say instead 'designate a Digital Object Identifier (DOI) for the CF data object'. (I think the data object is a less ambiguous concept than the 'data' or 'data set'?) If there can be multiple DOIs, as suggested at the very end of the post, that needs to be part of the description, e.g., 'designate one or more Digital Object Identifiers...'.

I think you should also explicitly declare the purpose of the DOI, namely it is for citation. Which suggests it should be unique (only point to one file, or else how do you know what data exactly are being cited, and which or how many data sets have that DOI?), but use cases may vary.

Benefits: DOIs allow easy automation for tracking the scientific impact of the data on the exact same fashion that scientific publications are tracked with DOIs. Anyone involved in the resulted data can be recognized, including funding agencies.

'on the exact' => 'in the exact'

Status Quo: An increasing number of scientific journals start to require a DOI for the dataset used in the publication. Many groups already include DOI as an attribute in its NetCDF-CF datasets but without a standard, thus hard to automate.

delete 'start to', make 'in its' => 'in their'

doi: Digital object identifier (DOI) of the dataset. For simplicity, the proxy part of the DOI is dropped, so it is composed by the suffix plus the prefix only, e.g. “10.21238/S8SPRAY1618”.

The proxy part should not be dropped. I know it is not explicitly required for tracking, but however the original DOI is created and distributed should be supported for this attribute. It communicates information to the reviewer, and it allows exact comparison with all the other uses of the exact same DOI (many of which will be syntactical comparisons of the entire string, is my bet).

The strategy adopted by AMS journals, and more recently by AGU, was to require citation of the dataset DOI used in any publication in the references list (https://www.ametsoc.org/ams/index.cfm/publications/authors/journal-and-bams-authors/formatting-and-manuscript-components/references/dataset-references/).

The cited policy does not require just DOIs, it explicitly says 'doi or URL' (and see the examples at the end of the document, that include a URL).

Whenever possible, datasets should be referenced directly via a listing in the references in the following style: Dataset authors/producers, data release year: Dataset title, version. Data archive/distributor, access date (DD Month YYYY), data locator/identifier (doi or URL).

With this in mind, you may wish to alter your request to make it "Add citation attribute", and allow the content to be either DOI or IRI (which can be easily distinguished by people and computers, especially if they use the full string instead of clipping of the doi:// part).

Although other standards allow the use of DOI, for example the id attribute recommended by ACDD, it conflicts with possible uses of DOIs. For instance, while id is stated to be unique, the same dataset DOI could be used in multiple files with chunks of the dataset assigned by the DOI.

'chunks of the dataset assigned by the DOI' is a bit confusing, I think you mean here 'different chunks of the dataset assigned the same DOI'? Which is redundant but clearer.

It is not being argued that DOI is the best option to link data nor it is a perfect solution, but this is the current standard to scientific impact tracking for articles, and this will not change soon.

Again, not quite the standard and not the only practice.

attribute name would be “doi” multiple DOIs are allowed with space delimiter separation in a character array

If multiple DOIs are allowed, this should be specified throughout the documentation.

graybeal commented 5 years ago

@JimBiardCics I'm good with the principle. But since it isn't a unique ID, I would strong encourage a different name. (In fact the relationship is many-to-many, in the request: One ID can attach to many data sets, and many IDs can attach to one data set.)

He has already separation by space-separated in character arrays. Since a few weird identifiers can have commas (ick), let's stick with the space approach.

TobiasWeigel commented 5 years ago

Persistently identifying each digital object has benefits independent from the citation case. In CMIP6, each file has received an individual hdl: identifier, and this helps with tracking file versions, replication, and makes it possible to slice the whole CMIP6 data space differently for different purposes using collections. These are important benefits at the cyberinfrastructure level that also provide indirect benefits for users through new services. Citation is a specific use case for persistent identification, and there are implications from assigning DOIs such as having citation metadata and ensuring persistence of the objects themselves that would not fit the CMIP6 file identification case.

An underlying question is that of object granularity. In some cases, it may be totally fine to have a doi at the file level (identifying only that single file), but this is at least not the case for CMIP. In CMIP6 we now have the precedence of constructing collections of files as datasets, and those two levels each bear their own identifiers, and the identifiers are linked via metadata to describe the collection structure. This way, we can work with identifiers for (possibly not finalized, citable) objects, but can also assign DOIs at the level of granularity that is most appropriate for citation.

In consequence, I would again motivate to 1) name the field more general than 'doi', e.g., "persistent_identifier", "pid" or "tracking_id" 2) mandate the use of identifier namespace prefixing (doi:, hdl:, ark:, urn: etc.) 3) solve the hierarchy issues by building collections, and not fix the citation granularity at the file level

A narrower solution bears the risk that it would only apply to some cases where CF is relevant and this would certainly not be wise in the long term.

JimBiardCics commented 5 years ago

I heartily agree with @TobiasWeigel. If we are going to have this attribute in CF, let's take this route. It might be good to make the attribute name plural to reflect the possibility of multiple elements.

martinjuckes commented 5 years ago

There does appear to be some overlap with the CMIP6 Handle use case and the ACDD id specification, but I think they are distinct as it stands. From reading the ACDD web-pages it would not surprise me that id is being used as an identifier for datasets containing multiple files, though I think the intention is for it to identify a file. There is a distinction, however, in that ACDD is designed around identifiers of the form <naming authority>:<code>, where <code> is a unique identifier within the namespace managed by <naming_authority>. For DOI and handle.net the structure is, I believe, <namespace>:<authority>/<identifier>. The distinction is that DOI and handle.net provide a governance structure within which multiple institutions can act as naming authorities issuing identifiers.

It is clear that the ACDD id is not, in general, a resolvable identifier ("edu.ucar.unidata" is given as an example of a naming authority). DOI and handle.net identifiers support well defined APIs which provide access to additional structured metadata. I feel that there is a good case for a new global attribute which is specifically for resolvable identifiers ... perhaps restricting it to those that provide some guarantee of persistent structured metadata. I.e. more general than DOI, but still restricted to identifiers that can be queried by a simple bit of python code.

graybeal commented 5 years ago

Martin, on what basis do you conclude the following? I have never thought the form of the ACDD id, at least, was constrained in any way. (And examples in ACDD should definitely not be considered exhaustive!)

On Jun 13, 2019, at 9:20 AM, Martin notifications@github.com<mailto:notifications@github.com> wrote:

There is a distinction, however, in that ACDD is designed around identifiers of the form :, where is a unique identifier within the namespace managed by .

Re the most recent suggestions: I think the proposal was for a citation device, not an identifier. If that is true, then to be clear, arguing that we should change it to an identifier is not 'taking a route', it is opposing the original request.

If we want an identifier that is in some way more specific in its resolvability/parseability than ACDD's id, then let's put in a new proposal for that. But if I recall correctly, this was discussed at the time for the 'id' (I was hoping the id could be resolvable) and it did not receive support, perhaps for backwards compatibility reasons.

John

martinjuckes commented 5 years ago

Hello @graybeal , sorry if I have mis-understood the original purpose of the proposal. @castelao accepted the idea of generalising the proposed new attribute to support at least hdl and doi: I think this is a good idea because these two families of identifiers have a common framework from an informatics perspective, but the hdl version is not serving the same objectives as the doi. In particular, it is not an alternative approach to data citation. As @TobiasWeigel has mentioned, it is being used in CMIP7 to provide a means of tracking millions of files. Research papers may well be based on results derived from hundreds of thousands of files: it is definitely not expected or intended that the hdl identifiers of these files be used for citation. It is hoped that the identifiers will be used for tracking data used, and publications should include a list, or a reference to a list in a repository. The doi, issued or a dataset which may contain many files, remains the recommended option for data citation.

In other words, generalising to support an additional identifier is not opposing the original intent of supporting the use of doi identifiers for citation, but would make the attribute support other uses. I would like to see the scope of any generalisation defined .. and I believe it would be useful to restrict it to identifiers which are in some sense resolvable.

Unlike the doi, the content of the metadata that can be retrieved from the hdl service is largely at the discretion of the institution publishing the identifiers. What we can say, from the standards perspective, is that there is a robust way of obtaining a dictionary of additional metadata from the hdl service. In the CMIP7 case this will provide a checksum for the file, and dataset information which can be used to check whether the file has been superseded -- but I don't think we should go into that level of detail in CF (at least not in this proposal).

My interpretation of the ACDD id is based my understanding that the naming_authority attribute should be in some sense responsible for the issuing the ID. I agree that the definition is pretty open ended, and that is perhaps the more important distinction.

JimBiardCics commented 5 years ago

Here is an overview of the persistent identifiers in common use right now. From this page.

Schemes Since the problem of persistence of an identifier is created by humans, the solution of persistent identifiers also has to involve people and services not just technologies. There are several persistent identifier schemes and all require a human service element to maintain their resolution systems. The main persistent identifier schemes currently in use are detailed below.

Digital Object Identifier (DOI) DOIs are digital identifiers for objects (whether digital, physical or abstract) which can be assigned by organisations in membership of one of the DOI Registration Agencies; the two best known ones are CrossRef, for journal articles and some other scholarly publications, and DataCite for a wide range of data objects. As well as the object identifier, DOI has a system infrastructure to ensure a URL resolves to the correct location for that object.

Handle Handles are unique and persistent identifiers for Internet resources, with a central registry to resolve URLs to the current location. Each Handle identifies a single resource, and the organisation which created or now maintains the resource. The Handle system also underpins the technical infrastructure of DOIs, which are a special type of Handles.

Archival Resource Key (ARK) ARK is an identifier scheme conceived by the California Digital Library (CDL), aiming to identify objects in a persistent way. The scheme was designed on the basis that persistence "is purely a matter of service and is neither inherent in an object nor conferred on it by a particular naming syntax".

Persistent Uniform Resource Locator (PURL) PURLs are URLs which redirect to the location of the requested web resource using standard HTTP status codes. A PURL is thus a permanent web address which contains the command to redirect to another page, one which can change over time.

Here's another, more in-depth discussion. They mention other persistent identifiers that have narrower focus, such as LSIDs and XRIs.

It's worth noting that DOI is a profile of Handle.

A survey of web discussions shows that there this whole domain is in a certain degree of flux, with some people particularly advocating for ARK over DOI for data. This is part of the reason I think we really need to go with a more generic name, such as persistent_ids. Let's not box ourselves in when we don't need to.

ngalbraith commented 5 years ago

I'm not sure what we (CF community) are adding by 'defining' this field. It is currently allowed in CF-compliant files, and unless I missed something, mentioning it in the CF docs doesn't make it any more useful.

The proposal specifies syntax details that may not be appropriate in some cases. "For simplicity, the proxy part of the DOI is dropped, so it is composed by the suffix plus the prefix only" - this is putting an unnecessary restriction on the use of DOIs in CF files, since other standards may call for something different.

As stated in the original proposal, the use of a DOI is required by some publishers, and so it's being used when appropriate (or when required, even if inappropriate, I guess). I believe everyone knows what it means, and how it is used - why does CF need to be involved at all?

castelao commented 5 years ago

I'm sorry for the slow response. Thank you for all the inputs, I do appreciate them. As some of you already noted, I believe that we are mixing things here (id x citation) and I was probably responsible for starting that confusion with my initial proposal. Let me try to walk this through.

Do we agree on the importance of a unique persistent identifier, whatever is your preferred solution? That would be something that could allow designating one specific file. Does not necessarily mean that everyone must use it, nor that there is one best solution for all, but is OK to assume that everyone agrees that we must have some good solution to assign an id for a file?

For that purpose, the ACDD:id seems to be generic enough to accept the several possibilities discussed here (hdl, ark, urn, ...) as defended by @JimBiardCics. I agree with Jim that if we can use something that already exists, let's just do it. @martinjuckes and @TobiasWeigel, it makes total sense for me your approach to managing CMIP data. With so many variables, versions, members, etc, one must have a solid way to tag each file. I believe that hdl does not violate the ACCD:id definition, but ACDD:id brings the benefit of allowing less robust id systems for other smaller datasets that don't require a sophisticated identification. Some people will do a better job on that than others but is for the data provider to decide their path (and also pay the price of bad choices). I support Jim suggestion on using the field ACDD:id for the file identifier.

Note that DOI could be used to track individual files. I'll limit in saying that I don't do that myself. I do not use DOI to track my data files.

Another interesting point was raised by @graybeal - why CF does not have an id equivalent attribute? My impression is that such id is an operational matter, more than just discovery metadata. Thus it would be a legitimate case for CF according to @JimBiardCics distinction between CF and ACDD. I'm not sure that I want to suggest that, but if this is the case, we do have a precedent. All the global variables in CF are duplicated in ACDD (title, institution, source, history ...). If CF adopted the global attribute id, that would require much caution to avoid conflicts and guarantee backward compatibility. Maybe by using the exact same definition, which allows a quite broad spectrum of possibilities? Or it might require a new attribute like 'persistent_ids', as suggested by @JimBiardCics. I don't have an opinion yet if that would be worth the redundancy, but I agree that CF lacks that.

Now we are finally getting on my point. The goal of my proposal was to address the support for an efficient citation and consequently track of scientific impact. The natural choice was to use what already exists and is well established in our scientific community, the DOI, so that's what I did. But I learned that there are other options for citation, so of course, it should be a generic field and allow different standards instead of restricted to doi. I believe that was the idea of @jhausman on her comment early in the discussion, but I didn't get it at that point. @graybeal suggested using a generic 'citation', maybe 'citation_id' would be more explicit. As a generic solution, the proxy part would obviously be required back. @TobiasWeigel and @martinjuckes, do I understand it right that you track your files with hdl, and recommend the users to cite the doi of the data collection? One thing is the file identification which would be a tag 1-1, another thing is the citation which is 1-N (with some edge cases of N-N), and I don't think that we can resolve the two of them at the same time without a high price. So I'll change my proposal for a generic field that would contain the citation identification.

OK, now why add a citation identification? We could use the citation as a text in one of the available attributes like many already do, and that could even include the DOI (as a text) on it, and with a sufficiently long enough regexp one could find it. Also, CF indeed gives the freedom for each one do it as they want. That is already happening as I mentioned in my proposal. I already saw some variations: doi, DOI, digital_object_identifier, doi_url, ... Well, the same argument for why we use a standard table of names, or we suggest to use only 'comment' instead of 'Comment', 'COMMENT' or 'comments' is the reason why it would make a difference to define a standard that accepts the DOI. I cannot understand the argument of letting each one decide how to do it at the same time that we defend CF, ACDD, and so many other standards. I can see someone think that a citation id is not relevant enough to have an attribute, but I can't imagine a PI who wrote a proposal to produce some sort of data that would be against a better way to cite that data and receive credit for that data. This matter is not about ego, but to acknowledge the funding and efforts to achieve that data, and seek the chance to keep doing more. I think it is incoherent to approve the need to assign a DOI for the CF-Conventions document but neglect a DOI for the data. I believe that a citation field is so fundamental as knowing the title of that dataset. I think that CF should support the recognition for everyone, especially the funding agencies, responsible for producing each netCDF-CF dataset.

I'll change my proposal for a generic citation identifier, that should allow other options than DOI. There are many other good points and ideas that I didn't mention in this comment, but I intend to include in the new version of the proposal. I'm very interested in hearing your feedback.

martinjuckes commented 5 years ago

@castelao : thanks for an excellent review. The revised/clarified focus on citation_id certainly makes sense. This rules out the use of this attribute for the CMIP style handles, which are not designed for citation, but you have drawn sensible boundaries for the attribute. As you say, the CMIP style handles are designed for tracking individual files.

There is some overlap the CMIP handle and the ACDD id, but the ACDD id appears to be closely identified with the THREDDS dataset@id (where dataset in the THREDDS context means a file) and CMIP, like many other communities, has a tradition of using human-friendly strings for this identifier. It appears on many pages of the THREDDS interface which are designed to be read through a browser. Having an opaque unique identified such as a DOI or handle string in the ACDD id would conflict with customary usage.

I agree that traceability of data usage is becoming important, and the DOI is the best available mechanism, so there is a good case for having a specific place in a netCDF file and recommending that people use it.

regards, Martin

JimBiardCics commented 5 years ago

@castelao I agree with @martinjuckes. That was a great summarization. I believe I was mistaken before, and the ACDD id attribute is intended to be used only as a 'per file' identifier as @graybeal said. So I think that attribute should be off the table as far as this discussion is concerned.

I still feel that this attribute is more in the purview of ACDD, and I believe that we need to develop a culture where we embrace other conventions and stop trying to add everything to CF (and re-inventing wheels along the way). Saying that some people don't use ACDD is not actually an argument for adding an attribute to CF that more correctly belongs in ACDD. We should, instead, tell people to start using ACDD. While I quite like the proposed persistent ID attribute, I also agree with @ngalbraith.

graybeal commented 5 years ago

From ACDD 1.3 definition for id:

The combination of the "naming authority" and the "id" should be globally unique, but the id can be globally unique by itself also. IDs can be URLs, URNs, DOIs, meaningful text strings, a local key, or any other unique string of characters.

On Jun 17, 2019, at 3:05 AM, Martin notifications@github.com<mailto:notifications@github.com> wrote: Having an opaque unique identified such as a DOI or handle string in the ACDD id would conflict with customary usage.

I may be misunderstanding what you're getting at here, Martin.

As a person closely involved in the extensive discussions that resulted in the last version of ACDD, I believe that this conclusion is not supported by those discussions, or by the normative definition of ACDD id. If a user wants to use a DOI or handle string in the ACDD id, I would consider it valid, and know of nothing in its description or design that would preclude it. The doi or hdl section of the ID is essentially the naming authority in these cases (the authority that has allocated the identifier per its specification, if not actually chosen the identifier).

(I'm also not aware of any close identification with the THREDDS id, that's the first time I've ever heard of that, but of course it could be historically linked. )

I must say, the naming authority/id dualism is not entirely helpful in this definition, but it is what it is/was, I guess.

Given the foibles of memory, it's always possible I'm not fully presenting the arc of the agreement.

John

castelao commented 5 years ago

@martinjuckes and @JimBiardCics, thanks!

I didn't hear anything against the concept of an attribute to hold the citation identification, so I'll assume that we agree on that and the question now is if that attribute should go in CF or proposed in some other standard like ACDD. I'll leave the details on how to implement it for later.

@JimBiardCics , in your opinion, what are the bounds between CF and ACDD? Do you believe that we should never more add any new attribute to CF but direct everything to ACDD instead? Otherwise, which type of attribute should be considered for CF? In the current CF documentation is there any directive on that subject?

JimBiardCics commented 5 years ago

@castelao, that's a great question, and one that we haven't really tackled. ACDD is focussed on providing discovery metadata. As quoted in the ESIP "What is ACDD?" page:

“These conventions identify and define a list of NetCDF global attributes recommended for describing a NetCDF dataset to discovery systems such as Digital Libraries. Software tools will use these attributes for extracting metadata from datasets, and exporting to Dublin Core, DIF, ADN, FGDC, ISO 19115 etc. metadata formats.”

CF, on the other hand, says:

The conventions define metadata that provide a definitive description of what the data in each variable represents, and of the spatial and temporal properties of the data. This enables users of data from different sources to decide which quantities are comparable, and facilitates building applications with powerful extraction, regridding, and display capabilities.

My reading of these two definitions is that CF is primarily concerned with internal structural and usage metadata—describing what is contained in each variable and how the different variables relate to each other, while ACDD is primarily concerned with external discovery metadata—describing properties of the whole file (such as the temporal-spatial envelope) with an eye to making it easier for users to find files of interest. There is clearly overlap between the two, but when you consider that CF declares only 9 global attributes—of which only 5 are directly relevant for discovery—it seems to me that there is a relatively clean distinction. In addition, ACDD has, in effect, adopted 7 of the CF global variables (makes sense, since CF came first), but doesn't address the two CF global attributes that relate most directly to structure and usage (featureType and external_variables).

So, to answer your question directly, I think we should define attributes where they make the most sense. I think an attribute such as persistent_ids (or whatever we want to call it) fits best with ACDD with its primarily outward focus, whereas an attribute such as variable_vertex_count (as a hypothetical new attribute from a different discussion) fits best with CF with its primarily inward focus. It may not always be clear, but I think it will be in most cases.

castelao commented 5 years ago

@JimBiardCics, thanks. I like that distinction in concept, but as you pointed, there are some overlap and some cases that are not so clear for me. I have that dual feeling for citation_id. Before I go further on this argument, I have one basic question, is ACDD active? I've been using ACDD-1.3. It looks like that the last update was in early 2015, is that correct? I know that there are some people here that are closely related to that. If is that active, how is the proposing process?

JimBiardCics commented 5 years ago

@graybeal, can you address @castelao's question? I believe it is active. I know a number of people involved with it, and I think they are actually working on it.

graybeal commented 5 years ago

Great summary @JimBiardCics .

@castelao It is my sense, based on questions and personal exchanges, that ACDD remains heavily used. Several large data provider communities adopted ACDD long ago and I'm pretty sure they all continue to use it.

There are occasional questions on the list for managing ACDD, and those get discussed by multiple people and answered. There have only been a few requests for changes during that period, none of them are actively moving forward that I know of. (IIRC, one was going to be quite complicated and the author dropped it, a few others were considered either not appropriate or not likely to be adoptable by responders, and were likewise dropped I assume.) I like to imagine that there are not many requests because it's such a well-decided and well-structured specification…

So while you won't see a lot of traffic about it, I think that doesn't mean it's not 'active' as a standard. And I think if a request like this comes to ACDD, the responses would give you an idea whether it would go through easily and quickly, only after some discussion, or not for a good while. Feel free to contact me off-line if you want to discuss.

castelao commented 5 years ago

Just to clarify, I think that ACDD it is largely used. At least I use myself together with CF. My question is how one could propose something new to ACDD?

On Wed, Jun 19, 2019 at 2:56 PM John Graybeal notifications@github.com wrote:

Great summary @JimBiardCics https://github.com/JimBiardCics . Point taken about recovery of comment history, I'm not at all sure myself. Likely Google knows.

@castelao https://github.com/castelao It is my sense, based on questions and personal exchanges, that ACDD remains heavily used. Several large data provider communities adopted ACDD long ago and I'm pretty sure they all continue to use it.

There are occasional questions on the list for managing ACDD, and those get discussed by multiple people and answered. There have only been a few requests for changes during that period, none of them are actively moving forward that I know of. (IIRC, one was going to be quite complicated and the author dropped it, a few others were considered either not appropriate or not likely to be adoptable by responders, and were likewise dropped I assume.) I like to imagine that there are not many requests because it's such a well-decided and well-structured specification…

So while you won't see a lot of traffic about it, I think that doesn't mean it's not 'active' as a standard. And I think if a request like this comes to ACDD, the responses would give you an idea whether it would go through easily and quickly, only after some discussion, or not for a good while. Feel free to contact me off-line if you want to discuss.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cf-convention/cf-conventions/issues/160?email_source=notifications&email_token=AAOQXZLIPF7MKD4SKJP3GHLP3KTO7A5CNFSM4HCZ7YQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYDMVAY#issuecomment-503761539, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOQXZMY66PBJSV2Z5FVXY3P3KTO7ANCNFSM4HCZ7YQQ .

bnlawrence commented 5 years ago

I've just caught up (I think on this). There are a couple of comments early on that i don't agree with around (intended) use of DOIs and the persistence of the object to which it points.

Leaving aside the syntax (who owns the ID), there are some semantic and policy issues to discuss:

Let's be clear, a DOI only points to a landing page (there has to be a service delivering the object at https://doistuff). Here's my take on the issues raised.

  1. DOIs point to landing pages, so by definition they are "Discovery metadata" (as defined here: https://doi.org/10.1098/rsta.2008.0237).
  2. The landing_page tells you how to de-reference the object (or objects).
  3. There might be multiple DOIS's pointing to the same object (nested datasets as discussed).
  4. What we put in the objects is "A" metadata (and sometimes "B"), but mostly it's telling software how to use the data.
  5. It is helpful to put D metadata in files, but not all of it. In particular it risks a category error - putting metadata in files that need to be updated when the management of the file changes, leading to version updates, multiple copies of files, and unnecessary copies of binary data.
  6. In general one does NOT allow objects with DOIs to change, since one cannot then be sure that a citation is pointing at the object as it was. This is a pretty hard and fast rule, if we start breaking it, then it is slippery slope.
    1. If we put one DOI in a file, but it forms part of several "datasets" as it is nested, are we telling the user which DOI is more important. At data production time I don't think we know that.

I could write more, but you can see where I am going ...

My preferred solution is to ensure that there is a UUID in the file, and a well known service which maps the UUID back to one or more DOIs ...

JimBiardCics commented 5 years ago

@bnlawrence Thanks for chiming in! I expect that the CF community (and the ACDD community) is not all that interested in standing up a persistent UUID mapping service. I think the ACDD id and naming_authority attributes provide the mechanism for implementing your preferred solution.

Do you see any problem with adding persistent identifiers to files if they are dereferencing to landing pages for collections? It seems to me that a file containing a persistent identifier and a creation date would provide sufficient information to allow a user to sort out the appropriate information about the collection/dataset at the web site when the DOI was dereferenced. In the nested / cross-cutting example you mentioned, it still seems that a well-rounded set of ACDD metadata with a DOI or DOIs wouldn't pose any real difficulty.

But this is yet another reason I'm in favor of leaving this to ACDD. They have more people involved who know more about this.

graybeal commented 5 years ago

@castelao To propose something new re ACDD, send email to the ESIP Documentation list at esip-documentation@lists.esipfed.org (list info at http://lists.deltaforce.net/mailman/listinfo/esip-documentation). The ESIP Documentation cluster has the role of managing this specification.

@bnlawrence A few nits below, but may I ask you to spell out where you are going? Are you saying you don't like the idea of a citation_id, or of allowing DOIs as the citation_id, or something else? Other thoughts:

  1. Re your points 4 and 5, can you spell out A, B, and D? Access, something, and Discovery? Ah, just spotted it in your referenced document, I'll paste below for others to see (feel free to update your post with it!).
  2. There is a large general subset of data in which the DOI-referenced object does change, and that's data streams. It is necessary to uniquely identify that content stream or other entity that is known to update over time, and DOIs and other identifiers are often used for that purpose (semantic identifiers are a common adopter of this approach, as well as many data streams and content streams).

Summary of A-B-C-D-E categories from https://doi.org/10.1098/rsta.2008.0237:

A-archive metadata describes the syntax and semantics (e.g. parameter descriptions) of the data objects themselves. The concept is further described below in §2b. B-browse metadata supports understanding the context of data and choosing between similar datasets. This concept is further described below in §2a. C-character metadata includes citations of the data itself, and post-fact assertions as to the quality of the data. Typically, such metadata does not always exist packaged with the data itself, but may exist in third party repositories (e.g. journal archives), etc. (Note that C-metadata itself may be discoverable by D-metadata.) D-discovery metadata is a subset of the browse and archive metadata, which is selected to aid finding data for evaluation or visualization and/or other uses. Typically discovery metadata is harvested and/or submitted to other organizations to aid data discovery. E-extra metadata is the core discipline- or instrument-specific metadata, which may be strongly typed (i.e. conforms to schema such as SensorML3) or consist of arbitrary documents. Providing consistent interfaces from B-metadata to E-metadata was one of the main challenges identified for the NDG).

bnlawrence commented 5 years ago

@JimBiardCics Yes, I'm strongly in favour of adding persistent identifiers to files, but not C or D metadata. I think that should be dealt with by web pages, and/or services, that bundle identifiers together and make the necessary links. I do want data to be citable and publishable ...

@castelao I don't like the idea of putting something which carries "D" semantics in files, which means I don't like the name __citation_id__, and I don't like the idea of a DOI being in the file. I do however very much like the idea of putting in place the necessary information to construct that information post-fact. (What I like is of course not the end game here, so thanks for bringing it up, as always, I'll go for the consensus, even if I'm on the other side :-) --- in this context, I think the right community to discuss it might be ACDD ... )

My historical thinking on these issues (in the context of climate modelling) is on my blog:

Streaming data is interesting. I don't think streams should have a DOI, but they should certainly have identifiers. Over the years I may not be winning on this ... but this usage conflicts with the notion of a digital object (singular) identifier ... and the use of DOI as a publication entity, not just an identifier (a la some of the discussion on my blog).

It all comes down to what we think a DOI is for. Persistent identifiers in files - big yes, DOIs, no :-).

bnlawrence commented 5 years ago

@JimBiardCics said: Disclaimer: I still think ACDD is the best place to address adding any persistent identifier attributes.

As Bryan Lawrence points out in the blog posts he references in the github issue there is some conflation of purposes for persistent identifiers. I tend to see two top-level purposes for persistent identifiers within a netCDF file.

  • Identifying the file itself uniquely.
  • Identifying some other object that has a relationship to the file.

There are likely others, but these are the ones that occur to me.

Within the second purpose I see a few different, related uses (and there are probably more):

  • Identifying a collection that the file belongs to.
  • Identifying a published paper that describes the data contained in the file.
  • Identifying an organization that is associated with the file contents in some way.

It seems to me that it's worthwhile to provide a means to accomplish both top-level purposes within netCDF files.

So what about DOIs in relation to the more general topic of persistent identifiers in netCDF files?

The International DOI Foundationhttps://www.doi.org/index.html says this about DOIs in the Section 1.6.1https://www.doi.org/doi_handbook/1_Introduction.html#1.6.1 of their Handbook:

DOI is an acronym for "digital object identifier", meaning a "digital identifier of an object". A DOI name is an identifier (not a location) of an entity on digital networks. It provides a system for persistent and actionable identification and interoperable exchange of managed information on digital networks. A DOI name can be assigned to any entity — physical, digital or abstract — primarily for sharing with an interested user community or managing as intellectual property. The DOI system is designed for interoperability; that is to use, or work with, existing identifier and metadata schemes. DOI names may also be expressed as URLs (URIs).

...

Unique identifiers (names) are essential for the management of information in any digital environment. Identifiers assigned in one context may be encountered, and may be re-used, in another place (or time) without consulting the assigner, who cannot guarantee that his assumptions will be known to someone else. Persistence of an identifier can be considered an extension of this concept: interoperability with the future. Further, since the services outside the direct control of the issuing assigner are by definition arbitrary, interoperability implies the requirement of extensibility. Hence the DOI system is designed as a generic framework applicable to any digital object, providing a structured, extensible means of identification, description and resolution. The entity assigned a DOI name can be a representation of any logical entity.

Based on this description of DOIs, it seems to me that a DOI is a valid, if poor choice for the first top-level purpose that I mentioned. It also seems to me that DOIs are well-suited for accomplishing the second purpose uses. They aren't the only way to accomplish these ends, but they certainly represent a way to do so.

Grace and peace,

Jim

bnlawrence commented 5 years ago

So I agree that you could use DOIs, but I think this is a lot of baggage at write time, and sends the wrong message about what a per file identifier is intended to achieve.

You can have a unique identifier at write time via the uuid mechanism, and for the data workflow, I think that is all you want and need. All these other use cases nearly always deal with aggregations of files, and attaching a DOI to aggregations is fine by me. The CMIP tracking id in each file gives us all we need. I can see a case for having a CF tracking_id which would accomplish the same result.

amilan17 commented 5 years ago

NOAA is including the DOI in the NetCDF files when the collection has a DOI. Each NetCDF file is a part of the associated collection and therefore is considered to be under that DOI's umbrella. I'll find a specific example for reference. Ideally, when user's use a subset of this collection, they will cite the resource with the DOI and provide some context as to the subset used (fileIDs, extent...).

balaji-gfdl commented 5 years ago

Isn't there a recursion problem? If you issue the DOI for the file and then embed it, the file checksum will not match the one that got the DOI....

V. Balaji Office: +1-609-452-6516 Head, Modeling Systems Division, GFDL Mobile: +1-917-273-9824 Princeton University Email: balaji@princeton.eduhttps://www.gfdl.noaa.gov/v-balaji-homepage

On Wed, Jul 17, 2019 at 2:49 PM Anna Milan notifications@github.com wrote:

NOAA is including the DOI in the NetCDF files when the collection has a DOI. Each NetCDF file is a part of the associated collection and therefore is considered to be under that DOI's umbrella. I'll find a specific example for reference. Ideally, when user's use a subset of this collection, they will cite the resource with the DOI and provide some context as to the subset used (fileIDs, extent...).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cf-convention/cf-conventions/issues/160?email_source=notifications&email_token=ABQJZVGQ3LCAXFDM2IIBBTLP75SUPA5CNFSM4HCZ7YQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2GHLJA#issuecomment-512521636, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQJZVBQDQ4TBFX6JAPNC73P75SUPANCNFSM4HCZ7YQQ .

castelao commented 5 years ago

@balaji-gfdl, as I mentioned before, DOI is not a checksum, so this is not a problem. It is possible to register a DOI even before creating the data file.

castelao commented 5 years ago

@amilan17 , if I understood it correctly, we do the same thing for Spray underwater gliders. Each data file goes with the DOI of the collection, so it is feasible for the users to cite the data. How do you include the DOI in the file? We use a global attribute named 'doi', and that is it, simple like that.

What I would like to achieve here is a consensus so we all do it in the same way: doi, citation_id, or anything else, but let's follow the same procedure and take advantage of easy automation.

kenkehoe commented 5 years ago

First, @castelao my apologies for not following this thread. That is my error.

Second, I feel this proposal has gone too far into the weeds. I think the original intent was to provide a reserved attribute name to indicate a DOI. I think we should stick to that scope. Anything dealing with what the DOI references or how the DOI is created or if the checker resolves the DOI to check if correct is outside the scope of this proposal. All the DOI attribute should do is specify that the text is a DOI link or "list" of DOIs. I don't think CF should dictate how a DOI should be used across all programs. That should be a decision for the data provider or DOI site. The linking and searching of the DOI can be done with some other tool with as much complexity as needed to resolve to the precision needed.

I think the question of using an attribute of "doi" or "resource_identifier" or something else comes down to how we want to use it. If the intent is to just put the information into the file we can push everything into "references" and then have software parse and just figure it out. But that is not nice to data consumers. So for example using existing CF standards we could do something like this:

references = "doi: 10.21238/S8SPRAY1618 hdl:21.14100/d9a7225a-49c3-4470-b7ab-a8180926f839 http://website.to.somewhere/papernumber/224"

and then expect the user to figure it out. But we can also make life easier on the user by turning things into key:value pairs

doi = "10.21238/S8SPRAY1618"
hdl = "21.14100/d9a7225a-49c3-4470-b7ab-a8180926f839"

so the user does not need to parse the text looking for a "doi:" keyword. I think that is the spirit of this proposal. If we also want to add "hdl" or some other attribute name to the reserved attribute list, we can do that.

I suggest we make the proposal simply that there is a reserved attribute named "doi" that has a value of character array (for now since we have a string argument happening elsewhere). The Character array will be a space separated list (same as all other character array attributes) that can list one or more DOIs.

doi = "10.21238/S8SPRAY1618 10.21238/AWESOME1618 10.21238/GLIDER1618"

I also suggest we place no restrictions on it needing to be global attribute only. It can be used under a variable or now that we have groups in as many groups as needed. Standard supersedence between variable vs. global attributes exists where if defined at a variable and global level the variable DOI supersedes the global value for that variable.

JimBiardCics commented 5 years ago

@kenkehoe If the proposal is for an attribute named doi, then I am against the proposal.

JonathanGregory commented 1 year ago

Dear Gui @castelao

This issue had a vigorous discussion, but did not come to a consensus or conclusion, last comment in July 2019. Should it be pursued, do you think?

Best wishes and thanks

Jonathan