cernopendata / opendata.cern.ch

Source code for the CERN Open Data portal
http://opendata.cern.ch/
GNU General Public License v2.0
656 stars 147 forks source link

DOI assignment - Citation recommendation #89

Closed pherterich closed 9 years ago

pherterich commented 10 years ago

For the DOIs, we might consider adjusting the primary dataset title we send for the DOI to make the citation look usable to a human reader. With the current title, a citation would look like: CMS Collaboration (2014). /BTau/Run-2010B-Apr21ReReco-v1/AOD. CERN Open Data Portal. doi:10.1234/EXAMPLE.VOMA.I3AE I think, it might be more user friendly to have the description: CMS Collaboration (2014). BTau primary dataset in AOD format from RunB of 2010. CERN Open Data Portal. doi:10.1234/EXAMPLE.VOMA.I3AE @katilp what do you think?

katilp commented 10 years ago

Good point! The latter one is better. But it is missing the reprocessing tag which is important for one-to-one identification of the files within the CMS internal sysetm Do you think it is possible to have CMS Collaboration (2014). BTau primary dataset in AOD format from RunB of 2010 (/BTau/Run-2010B-Apr21ReReco-v1/AOD). CERN Open Data Portal. doi:10.1234/EXAMPLE.VOMA.I3AE ?

pherterich commented 10 years ago

No problem, we just have to agree on what information we provide with the DOI and adjust the format accordingly to make the BibTex output etc. produce what we want.

tiborsimko commented 10 years ago

@pherterich Shall we store these alternative titles in the records themselves, or would you like to generate them on the fly? The former seems preferable from a long-term preservation point of view; otherwise we might not know in N years what we sent as title M years ago, if we change our how-to-generate-nice-titles in the meantime.

pherterich commented 10 years ago

We can introduce alternative titles in a separate field. Never worked with BibTex so I don't know how it works. For the visible citation recommendation I thought we could pass on the alternative title only to DataCite as DOI information and then maybe use the CrossCite service as Zenodo does. @espacial might know more about the technical details.

espacial commented 9 years ago

Well, @espacial's opinion goes in the direction of storing things as they are. I'd keep the long descriptive sentence as the actual title (245...) and keep the one-to-one identification of the files as the "PID" it is, in a different field. That would help us pushing nice metadata when we create the DOI without loosing the ID Kati needs. Of course, we should keep on showing it as prominently as we do now in the records.

pherterich commented 9 years ago

Us ladies just agreed that I'll introduce "BTau primary dataset in AOD format from RunB of 2010 (/BTau/Run-2010B-Apr21ReReco-v1/AOD)" as an alternative title (246) and that will be what we push to DataCite and then we can just plug in CrossCite. In my next go through the file, I will also create the real DOIs. They'll look like the following: 10.7483/OPENDATA.CMS.2P3D.5N8E Is everyone fine with that?

lnielsen commented 9 years ago

FYI: If you need I can move the new DOI citationformatter module I just did for Zenodo to Invenio which supports something like 600 different citation formats (example https://zenodo.org/record/11891). It uses CrossCite so metadata comes from CrossRef/DataCite and not the local database. It naturally also depends on having a DOI, thus only suitable if all records have a DOI.

katilp commented 9 years ago

Fine with me. This is also in line with the CMS publication titles, i.e. "CMS" does not appear in the title but in the author list. However, observing the CMS publication titles raises the question whether we should have the collision type and the collision energy in the title: "in pp collisions at sqrt(s) = 7 TeV" Maybe not, as we already define "RunB of 2010". This also underlines the "technical" aspect of the data, and rather than a "result" aspect of the publication title. But in any case, the collision type and energy should at least be in metadata fields.

tiborsimko commented 9 years ago

It naturally also depends on having a DOI, thus only suitable if all records have a DOI.

Based on past discussions, it was mostly the primary datasets only that were targetted for DOI attribution... so derived datasets would be out. @pherterich @espacial

pherterich commented 9 years ago

In the examples, we assigned fake DOIs to almost everything. I would assign DOIs to primary datasets and the derived datasets. But it the end I think it should be @katilp who decides. I also think it would be nice to have some of the code with a DOI but for that we should probably ask the creators.

tiborsimko commented 9 years ago

Personally I'd advocate to have DOI for all records managed by the portal, since all should be preservation-worthy. However, in some cases the file resources are external, we only point them, we don't manage them; so in these cases we should be careful.

katilp commented 9 years ago

The primary datasets all get a DOI. For the derived datasets, I think they should get one as well.

For the code, following the logic above (DOIs for the derived datasets which have the code to produce it on the portal), I agree with @pherterich that it would be good to have a DOI for them. The code with DOI should eventually be hosted on the portal or on a specific github area connected to it.

@tpmccauley and @ayrodrig, do you agree (after when the fixes and updates in the short-medium-term future are done in the cutrent areas) that your examples for producing ig (http://opendata.cern.ch/record/550) , cvs (di-muons for the histogramming) or pattuples (http://opendata.cern.ch/record/200) and the analysis example (http://opendata.cern.ch/record/101) get an DOI?

@tpmccauley what about event display ? For the histogramming examples (to come), I'm not sure - @tpmccauley : what do you think?

TimSmithCH commented 9 years ago

"The code with DOI should eventually be hosted on the portal or on a specific github area connected to it."

The code with a DOI should always be in the portal - its a snap-shot out of GitHub that is guaranteed to stay exactly the way it was when you snap-shot-ed it. In addition the actionable version in GitHub should be linked, but not relied on for the preservation actions of the portal.

katilp commented 9 years ago

@TimSmithCH Right, so we should have a snapshot at the time of release (or in general, after this first release, when the code enters to the portal) I'm however wondering of the how to handle this in practice:

My only worry is that we do not make too rigid a structure for the contributors. However, you are absolutely right in imposing certain rigour on us :-) And I guess there are no problems for updating (with fixes and corrections) the code with a new snap-shot without changing the DOI,

tiborsimko commented 9 years ago

And I guess there are no problems for updating (with fixes and corrections) the code with a new snap-shot without changing the DOI

If the code changes, then a new DOI should be issued in my eyes; similarly to how a new DOI is issued for an article if its full text changes.

As for practical GitHub <-> Portal synchronisations, one technique is to follow GitHub releases. E.g. analysis package code is released as Foo v1.0.0 on GitHub and the platform takes and preserves that. Afterwards a bug report comes, the code is fixed in GitHub, another bug report comes, and is also fixed, after which Foo is released as v1.0.1 on GitHub, and the platform takes and preserve that. This means that in-between releases only GitHub has the very latest bleeding-edge version of Foo; but each new release would trigger a new archival of Foo on the portal (meriting a new DOI in the process). (This is similar to how GitHub<->ZENODO synchronisation works.)

Finally, regarding information for general public, the portal itself could present any needed instructions (see the email thread with Jochen), which can then be updated independently of GitHub releases and DOIs. This is good especially for minor clarifications, as triggered by questions from the general public, for matters that do not need new code release.

katilp commented 9 years ago

The GitHub<->ZENODO synchronisation guide was really nice, thanks @tiborsimko for pointing to it. Do you have use-cases of bug fixes/other updates of some code in ZENODO? The instruction page does not really give that possibility. Do I understand correctly that it requires always a new DOI? Do papers get a new DOI for an Erratum?

The code with DOI is an interesting case, because unlike most traditional items to be archived (papers, videos, figures), the instructions part becomes an essential ingredient of the record itself. The solution of having the instructions on the portal is probably the best and it may be interesting to study further (cfr analysis preservation portal) how these instructions could be structured in general.

Worth noting, however, that the purpose of the code on the portal (as it is know) is rather to be examples or use of our data than a preserved record of some final established result. But you are probably right, the purpose (whether it is an example or code used for published analysis) does not necessarily change the archival procedure.

tiborsimko commented 9 years ago

@pherterich @espacial Can you please create any remaining DOIs so that we can close this issue?

pherterich commented 9 years ago

Just to confirm, @katilp the CMS masterclass files such as http://opendata.cern.ch/record/300 shall get a DOI? @tiborsimko I create DOIs for Ana's GitHub code and you create the snapshot? ALICE and LHCb (what about ATLAS?) masterclass datsets get a DOI, am I missing something?

katilp commented 9 years ago

Yes, this and similar derived datasets are all files for which CMS CB has given an approval.

tiborsimko commented 9 years ago

@tiborsimko I create DOIs for Ana's GitHub code and you create the snapshot?

Yes, that would be good. @ayrodrig Can you please tag some good version of your code on GitHub with some tag, for example v1.0.0, so that we could grab well-defined version? I could then attach the corresponding tarballs next to your records:

and @pherterich would stamp them with DOIs.

ayrodrig commented 9 years ago

@ayrodrig Can you please tag some good version of your code on GitHub with some tag, for example v1.0.0, so that we could grab well-defined version?

Done.

pherterich commented 9 years ago

@ayrodrig Are you fine with a "CMS-DOI" like this 10.7483/OPENDATA.CMS.GS6N.54B9 or would you rather prefer them not to be associated with CMS? 10.7483/OPENDATA.GS6N.54B9

ayrodrig commented 9 years ago

@pherterich A "CMS-DOI" is fine with me. What are you using for other CMS contributions? I guess all the DOIs for each of the experiments should follow the same form. Right?

pherterich commented 9 years ago

@ayrodrig So far everything has a experiment specific DOI, I was just wondering if your code could be used outside of the CMS context and it might be worth signaling in the DOI. But a CMS-DOI is perfect. Thx!

ayrodrig commented 9 years ago

@pherterich Ok, thanks for clarifying. Should this record http://opendata.cern.ch/record/200 also have a DOI?

pherterich commented 9 years ago

This and http://opendata.cern.ch/record/101 are the ones I just assigned DOIs to and @tiborsimko will take a snapshot of the GitHub version you just tagged. That snapshot will be stored on the portal and will be the version the DOI will point to (as it's the persistent one and not changing as your GitHub version might).

tiborsimko commented 9 years ago

@ayrodrig @pherterich Thanks, I'll grab the tarball.

tiborsimko commented 9 years ago

@pherterich Done in 1f1608793c46136a2d67823fe2d9c94317db8741. Any other records in need of DOIs?

pherterich commented 9 years ago

I think I put all of them in the xml files, @espacial is minting the last ones today so they're all active from tomorrow on. Unless something gets a DOI assigned that I don't know of, this issue can be closed and then handled as concrete issues for new records that will need a DOI.

tiborsimko commented 9 years ago

OK, thanks, closing then.