Files: Persistent Identifiers for Files (DOIs for files)

eaquigley commented 8 years ago

Since we are moving towards individual pages for files, we also need to think about what the persistent identifier will be for them.

lwo commented 8 years ago

I am entering via: https://groups.google.com/forum/#!msg/dataverse-community/gtz2npccWjU/i7_EVs2LBgAJ

... I think persistent identifiers should not be derived from the local file id. An organization may want to migrate from their repository solution to dataverse and be able to import their PIDs if they already have them. And then rebind them to the dataverse dataset and future file pages. That would be possible with a String type, but not with a number.

djbrooke commented 7 years ago

Sebastian brought this up on the community call 8/16 and asked if it would be included in 4.6. There are currently no plans to work on this for 4.6.

mheppler commented 7 years ago

Closing this issue as a duplicate of Files: Need Persistent Identifiers/URL's for Data Files #2700.

pdurbin commented 7 years ago

A "DOIs for files" thread was just started at https://groups.google.com/d/msg/dataverse-community/JX2GLqPy_yE/6dzgXGVcCAAJ which reminds me that @djbrooke and I discussed this issue as well as #2700 last week.

In short, #2700 was a bit more about a specific need for putting into print specific instructions about how to download files from an installation of Dataverse. A combination of a DOI and a file name was sufficient so I'll close that issue.

This issue was originally closed as a duplicate of #2700 but this issue is actually the "DOIs for files" issue so I'll reopen it. Note that the title is "persistent identifiers for files" to be more generic than DOIs, to include Handles or whatever other schemes make sense.

scolapasta commented 6 years ago

Two recent discussions come up recently where we would benefit from persistent IDs for files:

supporting the Data Citation recommendations for granularity
using the persistent identifiers for provenance (as provenance should never change)

djbrooke commented 6 years ago

Thanks @mheppler for offering to drop in a quick mockup here.

There are some open questions about the ordering of the items in the file citation, what's included, and how to include both the file DOI and dataset DOI. @mercecrosas and @scolapasta will connect about this.

scolapasta commented 6 years ago

Discussed with @mercecrosas some of the open questions - she is going to RDA next week and will be able to consult with Martin Fenner from Datacite, so answers may have to wait until then, but we should be able to get started on some of the logic for this in the meanwhile.

mheppler commented 6 years ago

Here is the mockup of the file pg with the updated file citation format, and new File Persistent ID displayed.

file-doi-filepg

mheppler commented 6 years ago

UI IMPACT

[x] Citation block styling: gray outline+background, added small file icon (also add small dataset icon to citation block on dataset pg)
[ ] Citation format: moved file name up, before the DOI, moved dataset name to the end, as a suffix with a new "datasetName" bracketed label.
[ ] File metadata: added new "File Persistent ID" label and value

scolapasta commented 6 years ago

After further consulting with @mercecrosas a few additional points:

There will be one DOI (or other persistent identifier) for each file in dataverse. As an example, if you replace a file, that new file would have a separate DOI.
We're still determining what we will use for the identifier, but it will likely be based on the DOI for the dataset
We will not change the citation as above. Rather we will keep the citation for files as is, but aldd the doi at the end similar to the name with [fileDOI].

For example:

LastName, FirstName, 2017, "Dataset", doi:10.5072/FK2/XXXXXX, Demo Dataverse, V1; filename.txt [fileName] doi:10.5072/FK2/XXXXXX/YYY

adam3smith commented 6 years ago

I'm excited you're working on this, thank you! As requested by Gustavo, a couple of comments.

About the citation: I do think having two DOIs is unwieldly -- e.g. most reference managers are not going to be able to produce such citations, and they're going to be quite long. (As an aside, the DOI display guidelines (CrossRef, Datacite) now universally discourage the use of doi: and favor https://doi.org/. I think Dataverse should switch its default. It's one of the things we customize in our install)

While the suggested citation is more a question of taste, this:

We're still determining what we will use for the identifier, but it will likely be based on the DOI for the dataset

is bad practice. DOIs should not contain human-readable ("semantic") elements; the connection between file & data is handled in the metadata (isPartOf in the Datacite kernel), not in the DOI itself, and the DOI should not suggest to users that it is. See e.g. Martin Fenner on the topic here.

jggautier commented 6 years ago

Working on a draft DataCite metadata template for files.

rbhatta99 commented 6 years ago

latest Update before me handing this over: The latest commit is broken for now, as a lot of changes are needed to made to properly generate and register DOIs for individual datafiles. There are placeholders in various DOI/handle registration beans, which will take in DvObjects instead of datasets, which will accomodate for the registration of the datafiles. The current state of the branch is:

A new command has been written for the creation and saving of DataFiles (namely CreateDataFileCommand), which works fine as long as the files are not sent for registration and DOI generation
@jggautier has provided a template for the metadat of a DataFile, which needs to be sent over to Datacite for registration.
I have written a couple of functions in abstractIdServiceBean which takes in DvObjects instead of datasets, but have commented them out, as call a lot of functions, which in themselves use Datasets as arguments, and need to be changed to accomodate DvObjects.

mercecrosas commented 6 years ago

Sebastian,

Apologies it took me so long to reply. You bring up very useful comments:

On the File citation format with 2 DOIs: I agree with you are pointing out. We'll revisit it based on your comments. Since the beginning of Dataverse, we have tried to follow the format proposed by Altman & King 2007 (https://gking.harvard.edu/files/abs/cite-abs.shtml), but at the same times we want to be compliant with the Data Citation implementation roadmap as well as, as you say, follow a standard that works well for reference managers. So, based on all this, do you have suggestions for the format for the file-level citation that Dataverse can use? (of course, the citation metadata will include the file and the dataset DOI with the proper relationship, but we are here focusing on the format we display for file-level citation).
DO should not contain semantic elements: I agree and disagree. I know that Martin is a big advocate of this (ID should be opaque, and not contain human-readable-elements), and I understand well why is a good idea. However, user-friendly IDs do help users to quickly have an idea of what that object means, how it relates to others, etc. The system, of course, should not rely on this and even users should understand that what is important is the metadata and not to use any semantics in the ID as a guarantee of what it is. But still, I think it's useful for users. Dryad uses something similar, and at a first glance it seems useful. That's why we propose it. If the community feel strongly that we should use arbitrary IDs, we can change this. I'm interested in hearing opinions from the community!

Merce

Mercè Crosas, Ph.D., Chief Data Science and Technology Officer, IQSS, Harvard University T:@mercecrosas W:mercecrosas.com

On Thu, Sep 28, 2017 at 2:32 PM, Sebastian Karcher <notifications@github.com

wrote:

I'm excited you're working on this, thank you! As requested by Gustavo, a couple of comments.

About the citation: I do think having two DOIs is unwieldly -- e.g. most reference managers are not going to be able to produce such citations, and they're going to be quite long. (As an aside, the DOI display guidelines ( CrossRef https://urldefense.proofpoint.com/v2/url?u=https-3A__www.crossref.org_display-2Dguidelines_&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=jni44ER7oPa2xLcWjQxzXX2Ya2fFP2w88aLv1VPuoC4&s=ZnMSVSvGbwnBC7oFyLutrq3_6y6Y0FI9InsAPVprszU&e=, Datacite https://urldefense.proofpoint.com/v2/url?u=https-3A__support.datacite.org_v1.0_docs_datacite-2Ddoi-2Ddisplay-2Dguidelines&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=jni44ER7oPa2xLcWjQxzXX2Ya2fFP2w88aLv1VPuoC4&s=HjPxTTyMJbT404WY0GrDaNY7WvizbJjyHRT8WHxZ7OU&e= now universally discourage the use of doi: and favor https://doi.org/ https://urldefense.proofpoint.com/v2/url?u=https-3A__doi.org_&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=jni44ER7oPa2xLcWjQxzXX2Ya2fFP2w88aLv1VPuoC4&s=l4vXFmPwT2UmjGhoZ-Qrv43jytf-_MageTK5Xm-0Sl4&e=. I think Dataverse should switch its default. It's one of the things we customize in our install)

While the suggested citation is more a question of taste, this:

We're still determining what we will use for the identifier, but it will likely be based on the DOI for the dataset

is bad practice. DOIs should not contain human-readable ("semantic") elements; the connection between file & data is handled in the metadata ( isPartOf in the Datacite kernel), not in the DOI itself, and the DOI should not suggest to users that it is. See e.g. Martin Fenner on the topic here https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.datacite.org_cool-2Ddois_-23suffix&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=jni44ER7oPa2xLcWjQxzXX2Ya2fFP2w88aLv1VPuoC4&s=qKFsAs2krWWLiVl_YfwFq7rk5SFIAT6PldaJpXo6gZ8&e= .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_2438-23issuecomment-2D332924989&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=jni44ER7oPa2xLcWjQxzXX2Ya2fFP2w88aLv1VPuoC4&s=PRZssEJHAojjSSdufePICA2nrB_qPQUf7gDBZevTgtk&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AApQyOmzCPdG3WSU3JNy7d9mCc-2DS9zl8ks5sm-2DZJgaJpZM4FnlnG&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=jni44ER7oPa2xLcWjQxzXX2Ya2fFP2w88aLv1VPuoC4&s=xtuql6vJgDk6b6Ds6taAKGGmPrL0cCEUDeaypLWlpkA&e= .

pdurbin commented 6 years ago

I got a quick brain dump from @sekmiller and plan to code review his pull request: #4224

TaniaSchlatter commented 6 years ago

Here's a revised mockup - note that the citation button should say "Cite Dataset":

jggautier commented 6 years ago

Just some more clarifying notes after talking with @TaniaSchlatter:

The line below the file name should read "File PID" or ("Persistent ID"?), and of course followed by the file PID.
The file page URL there now is just a placeholder.
If the PID is a DOI (in some installations it won't be), it should be a URL, e.g. https://doi.org/10:7910/DVN/...

mheppler commented 6 years ago

Discussed the mockup @TaniaSchlatter added, compared to the UI work that I had already to this branch and found a common ground that looks something (read: exactly) like the screenshot attached. Sending to Code Review.

screen shot 2017-10-25 at 5 25 41 pm

landreev commented 6 years ago

What is the plan for the existing production DataFiles - they will need to be assigned individual identifiers, correct? Will there be a migration script? Is it going to be handled as a different issue?

landreev commented 6 years ago

I've moved the ticket into QA. I've done as much reviewing of it as can be done without doing a full QA on my own, I think. My main remaining question: the code that handles finalizing adding the newly created datafiles to the dataset, and saving them on the storage permanently, has been moved from the ingest service and into the create datafile command. So we save each file individually now, as opposed to the way it was being done, on a list of files, all at once. From just looking at the code, it's not immediately clear to me whether there is any performance hit related to this. So I recommend that we specifically test this during QA - try adding 1000 files via drag and drop, and see if there is any difference in performance, before vs after.

djbrooke commented 6 years ago

@landreev good catch(es). Yes we should provide these DOIs for existing files. There should be necessary migration scripts provided with this issue and we should provide it for any type of PIDs used by the community.

@landreev @kcondon @scolapasta do you think we should move this to QA and work on the migration scripts in parallel or should it go back to dev?

landreev commented 6 years ago

@kcondon @scolapasta @djbrooke There's enough working functionality in this issue as is, that can be QA-ed. So I'm not sure it would be justified, to delay it until the migration process is added.

That said, I did not work on this issue myself, and I will not be doing the QA - so mine is definitely not the authoritative opinion on this...

landreev commented 6 years ago

@kcondon @scolapasta @djbrooke (rather than moving the issue back into dev., opening a new issue for the migration process may be more in line with the "small chunks" approach - ?)

djbrooke commented 6 years ago

We need to add APIs for the migration script to use for this to be released and to have benefit to the community. Moving back to dev to add those APIs.

(updated because I read "migration process" as "migration script" :))

pdurbin commented 6 years ago

I hate to say it, but now that 4.8.3 is out upgrade_v4.8.2_to_v4.8.3.sql should be renamed yet again.

pdurbin commented 6 years ago

From standup, it sounds like this issue is still being worked on. We are trying to incorporate feedback from partners into the 2438-persistent-identifiers-for-files branch.

I did go ahead and fix the SQL script filename in 3166074

sekmiller commented 6 years ago

For testing the following sql script will create the new fields on the dvObject table, populate them from the values on the dataset table, and remove the fields from the dataset table: scripts/database/upgrades/upgrade_v4.8.3_to_v4.8.4.sql

To populate identifiers in existing files there is an api: http://$SERVER/api/admin/registerDataFileAll which requires a superuser token

When any kind of datafile (except for a file uploaded specifically as a thumbnail file) a DOI should be assigned. If you are setup for EZID you can verify that a DOI has been created for a file by using this link: https://ezid.cdlib.org/id/doi:10.5072/FK2/MFS2BW (with the file's DOI - which can be viewed on the file landing page.)

If a dataset containing a given file is destroyed, the DOIs for the corresponding files should be destroyed as well. If a file is removed from a subsequent version its DOI should remain. If a file is deleted before being published its DOI should be deleted.

pdurbin commented 6 years ago

@sekmiller discussed how it might be handy to test both #2438 and #4295 in a single branch so I created a 2438-4295-dois-for-files branch. I started with the 2438-persistent-identifiers-for-files branch and created that new branch. Then I merged 4295-File-download-api-via-persistent-identifiers into it. Then I merged develop into it. @sekmiller said he'd take a look. He can make a pull request and do the "connects to" stuff in Waffle if he thinks this is better.

dlmurphy commented 6 years ago

Just committed updated documentation for this issue in f97e87ca34caa527151fd08787427f2057314c79. Added a long overdue user guide section about the File Page, and made sure file DOI related sections were accurate across all guides. Documentation is ready for review!

kcondon commented 6 years ago

-Need to add doc for 2 new api endpoints x-File DOIs do not resolve to file landing page, gives 404, needs to write correct file pg url to record rather than dataset url. Confirmed issue for handle, need to check others. (Fixed for hdl, works for DataCite) -Uploading a large number of files (1000) is considerably slower (7-11 mins versus 1 min) when registering (This happens with DataCite on save too so it is this branch's processing, not necessarily registering since DataCite only registers on publish) (hdl 7mins, datacite 7:40mins). -Scrollbar on file table does not work with 1000 files, not present. Works in /develop.

djbrooke commented 6 years ago

Thanks @kcondon for the numbers on the performance hit. Do we see a similar (percentage) increase on 10 or 50 files?

kcondon commented 6 years ago

Good question: 10 files: 4s before, 6s after 50 files: 5s before, 26s after

Be aware that handles and doi from ezid register on create (upload/save) and so could potentially be mitigated by uploading in smaller batches. DataCite registers only on publish currently so smaller batches won't help there.

pdurbin commented 6 years ago

Uploading a large number of files (1000) is considerably slower (7-11 mins versus 1 min) when registering

@sekmiller asked me to take a look at this.

pdurbin commented 6 years ago

One observation I'll make is that as of d75e866 when I click a file in a search card I land on a URL with persistentId in the URL (i.e. http://localhost:8080/file.xhtml?persistentId=doi:10.5072/FK2/TSA7BQ ) but when I click the same file on the dataset page I see fileId in the URL instead ( i.e. http://localhost:8080/file.xhtml?fileId=237&version=DRAFT&version=. ). Do we want these to be consistent? I assume so. (Also note version=. which is very strange, perhaps related to the strange version=.0 reported at #4308.) I'm logged in as dataverseAdmin if that matters.

pdurbin commented 6 years ago

Uploading 1000 small text files as of d75e866 (via zip, attached) and then clicking "Save Changes" is definitely slow, taking about 8 minutes, as reported above. In addition, these three lines appear in server.log for each of the 1000 files:

[2017-12-12T10:45:18.123-0500] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dvn.core.index.DOIEZIdServiceBean] [tid: _ThreadID=106 _ThreadName=http-listener-1(4)] [timeMillis: 1513093518123] [levelValue: 900] [[
  String edu.ucsb.nceas.ezid.EZIDException: bad request - unrecognized identifier scheme]]

[2017-12-12T10:45:18.123-0500] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dvn.core.index.DOIEZIdServiceBean] [tid: _ThreadID=106 _ThreadName=http-listener-1(4)] [timeMillis: 1513093518123] [levelValue: 900] [[
  localized message bad request - unrecognized identifier scheme]]

[2017-12-12T10:45:18.123-0500] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dvn.core.index.DOIEZIdServiceBean] [tid: _ThreadID=106 _ThreadName=http-listener-1(4)] [timeMillis: 1513093518123] [levelValue: 900] [[
  cause]]

Finally, the files are not indexed but this is a known issue, first reported at #3243 (I created the dataset first and then later uploaded the files).

Here's the zip file I used to test: 1000uniquefiles.zip (created with for i in {1..1000}; do echo $i > file$i.txt; done).

pdurbin commented 6 years ago

I re-tested the same zip file with 1000 small files above on the develop branch as of 3a53923 and it was much faster. "Only" 30 seconds or so instead of ~8 minutes.

Also, I showed this to @kcondon already but I'm seeing this stack trace over and over in server.log just after uploading the zip file in the GUI but before clicking "Save Changes":

[2017-12-12T11:33:33.775-0500] [glassfish 4.1] [INFO] [] [edu.harvard.iq.dataverse.ingest] [tid: _ThreadID=30 _ThreadName=http-listener-1(1)] [timeMillis: 1513096413775] [levelValue: 800] [[
  cause.getMessage() was null for java.lang.reflect.InvocationTargetException]]

[2017-12-12T11:33:33.775-0500] [glassfish 4.1] [SEVERE] [] [] [tid: _ThreadID=30 _ThreadName=Thread-9] [timeMillis: 1513096413775] [levelValue: 1000] [[
  java.lang.reflect.InvocationTargetException
    at sun.reflect.GeneratedMethodAccessor640.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at edu.harvard.iq.dataverse.ingest.IngestableDataChecker.detectTabularDataFormat(IngestableDataChecker.java:592)
    at edu.harvard.iq.dataverse.util.FileUtil.determineFileType(FileUtil.java:290)
    at edu.harvard.iq.dataverse.util.FileUtil.createDataFiles(FileUtil.java:810)
    at edu.harvard.iq.dataverse.EditDatafilesPage.handleFileUpload(EditDatafilesPage.java:1794)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.sun.el.parser.AstValue.invoke(AstValue.java:289)
    at com.sun.el.MethodExpressionImpl.invoke(MethodExpressionImpl.java:304)
    at org.jboss.weld.util.el.ForwardingMethodExpression.invoke(ForwardingMethodExpression.java:40)
    at org.jboss.weld.el.WeldMethodExpression.invoke(WeldMethodExpression.java:50)
    at com.sun.faces.facelets.el.TagMethodExpression.invoke(TagMethodExpression.java:105)
    at org.primefaces.component.fileupload.FileUpload.broadcast(FileUpload.java:318)
    at javax.faces.component.UIViewRoot.broadcastEvents(UIViewRoot.java:755)
    at javax.faces.component.UIViewRoot.processDecodes(UIViewRoot.java:931)
    at com.sun.faces.lifecycle.ApplyRequestValuesPhase.execute(ApplyRequestValuesPhase.java:78)
    at com.sun.faces.lifecycle.Phase.doPhase(Phase.java:101)
    at com.sun.faces.lifecycle.LifecycleImpl.execute(LifecycleImpl.java:198)
    at javax.faces.webapp.FacesServlet.service(FacesServlet.java:646)
    at org.apache.catalina.core.StandardWrapper.service(StandardWrapper.java:1682)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:344)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
    at org.glassfish.tyrus.servlet.TyrusServletFilter.doFilter(TyrusServletFilter.java:295)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
    at org.ocpsoft.rewrite.servlet.RewriteFilter.doFilter(RewriteFilter.java:205)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:316)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:160)
    at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:734)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:673)
    at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:99)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:174)
    at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:415)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:282)
    at com.sun.enterprise.v3.services.impl.ContainerMapper$HttpHandlerCallable.call(ContainerMapper.java:459)
    at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:167)
    at org.glassfish.grizzly.http.server.HttpHandler.runService(HttpHandler.java:201)
    at org.glassfish.grizzly.http.server.HttpHandler.doHandle(HttpHandler.java:175)
    at org.glassfish.grizzly.http.server.HttpServerFilter.handleRead(HttpServerFilter.java:235)
    at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:119)
    at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:284)
    at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:201)
    at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:133)
    at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:112)
    at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:77)
    at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:561)
    at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:112)
    at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:117)
    at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:56)
    at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:137)
    at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:565)
    at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:545)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.BufferUnderflowException
    at java.nio.ByteBuffer.get(ByteBuffer.java:688)
    at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:285)
    at edu.harvard.iq.dataverse.ingest.IngestableDataChecker.testRDAformat(IngestableDataChecker.java:549)
    ... 59 more]]

pdurbin commented 6 years ago

Somewhat unsurprisingly, the performance problem seems to be in sending the requests over the wire to EZID to reserve a DOI one by one for each of the 1000 files. I'm not sure if there's a quick fix for this:

screen shot 2017-12-12 at 3 22 31 pm

kcondon commented 6 years ago

I wonder what it does for DataCite since that should not be sending requests on create/upload and it exhibits the same slowness?

pdurbin commented 6 years ago

@kcondon good thought. I can try to switch to DataCite and repeat the same test. Before I move of of EZID however, let me show another interesting finding. It appears that quite a bit of time is also being spent in generateDatasetIdentifier (23%), the alreadyExists method. @sekmiller suspects this may be a file-level operation and that we could simply rename the method if it's confusing:

screen shot 2017-12-12 at 3 45 22 pm

pdurbin commented 6 years ago

Ok, please ignore the actual time spent because I took a screenshot before "Save Changes" finished but for DataCite, the time (yes, it's slow) is being spent in the testDOIExists method which reaches out over the wire to DataCite using an HTTP client (org.apache.http.impl.client.CloseableHttpClient):

screen shot 2017-12-12 at 4 27 39 pm

pdurbin commented 6 years ago

@sekmiller and I did a little brainstorming on potential fixes to the performance problem of registering 1000 files with an external service like EZID or DataCite or Handle.

On "Save Change" after upload or on publish (when reserved persistent identifiers are made public or whatever) we don't want to see the spinner in the GUI for 8 minutes:

screen shot 2017-12-12 at 4 31 34 pm

On "Save Changes" it would be nice to lock the dataset and let the GUI return right away. In the background, the DOIs for each of the 1000 files could be reserved. There are a few places in the app where we do background processing like this:

ingest (JMS)
grant role (indexing is called with @Asynchronous if memory serves)
the "crawl filesystem" code before we introduce the "data package" concept (Java EE Batch API)
the SBGrid "lock dataset on publish while me move files from 'hold' to 'public'" concept (Workflows)

Maybe there are others? We need to make a decision on if we're going to try one of the approaches above or another approach in the current sprint or defer this work until later on.

By the way, I forget what the user experience is for SBGrid when you click "Publish" and have to wait for a lot of background processing but I assume we'd want the same user experience. That is, we wouldn't want to say that the dataset is published until each file has a public DOI or Handle.

scolapasta commented 6 years ago

I'd nominate spinning off the performance as a separate issue. That doesn't necessarily mean that it is not part of this pull request / merge, just that as a separate issue, we can separate out all comments related to just performance.

(Part of my reasoning is that I don't want my next comment to get lost :)

scolapasta commented 6 years ago

Discussed further @mercecrosas and for the identifier we've decided that we will make it configurable per installation. Initially we will provide two different styles, one will be an arbitrary identifiers, as datasets currently used, and the other will be one that will build on the identifier of the dataset (e.g. 1234/567 for a dataset with identifier 1234).

As @mercecrosas says above: "user-friendly IDs do help users to quickly have an idea of what that object means, how it relates to others, etc. The system, of course, should not rely on this and even users should understand that what is important is the metadata and not to use any semantics in the ID as a guarantee of what it is. But still, I think it's useful for users. Dryad uses something similar, and at a first glance it seems useful. "

kcondon commented 6 years ago

OK file urls: From file card: https://dataverse-internal.iq.harvard.edu/file.xhtml?persistentId=doi:10.5072/Q69UCB From dataset files tab: https://dataverse-internal.iq.harvard.edu/file.xhtml?persistentId=doi%3A10.5072%2FQ69UCB&version=2.0

So, they both now use the persistentId form as requested but I'm told the escaped format of the dataset files tab URL is due to how the backend processing works. Note that adding &version=2.0 to the end of the file card url (no escaping needed) also works.

pdurbin commented 6 years ago

@kcondon good catch. The same thing happens at the dataset level (and has for a long time) when you create a dataset. In the URL you see %3A for a colon (:) and %2F for a slash (/). I'm not sure if it's in scope to fix it at the dataset level as well or not.

pdurbin commented 6 years ago

The code in question that was generating a URL with persistentId=doi%3A10.5072%2FFK2%2FP3ENAC (ugly, percent-encoded) vs persistentId=doi:10.5072/FK2/P3ENAC (clean, not-percent encoded) was at https://github.com/IQSS/dataverse/blob/18eb900214815b1c666508c70f714af9dab402f9/src/main/webapp/filesFragment.xhtml#L221 and uses h:outputLink, like this:

<h:outputLink id="fileNameLink" value="#{widgetWrapper.wrapURL('/file.xhtml')}">
    <f:param name="persistentId" value="#{fileMetadata.dataFile.globalId}"/>
    <f:param name="version" value="#{fileMetadata.datasetVersion.friendlyVersionNumber}"/>
    <h:outputText value="#{fileMetadata.label}" />
</h:outputLink>

I can easily reproduce the same behavior with a more minimal h:outputLink example with hard-coded values:

<h:outputLink id="testOutputText" value="/dataset.xhtml">
    <f:param name="persistentId" value="doi:10.5072/FK2/LNDANK"/>
    <h:outputText value="h:outputLink does percent encoding" />
</h:outputLink>

https://stackoverflow.com/questions/24733959/houtputlink-value-escaped seems related.

The solution I implemented in 2ba7d9a was to replace the h:outputLink version with a simple a href version:

<a href="/file.xhtml?persistentId=#{fileMetadata.dataFile.globalId}&amp;version=#{fileMetadata.datasetVersion.friendlyVersionNumber}">
    #{fileMetadata.label}
</a>

I did look briefly at similar behavior on "create dataset" but the code looks complicated uses a different JSF/Primefaces tag (p:commandButton) and I don't think we should increase the scope of this issue any more.

pdurbin commented 6 years ago

@scolapasta noticed that I removed the call to widgetWrapper.wrapURL so I just added it back in 76982ce

pdurbin commented 6 years ago

Whoops, re-added with query params in the right place at b637c43

pdurbin commented 6 years ago

At standup I mentioned that the "Download URL" link on the file landing page still shows a number like this:

screen shot 2017-12-18 at 2 21 17 pm

We agreed that this should be a DOI instead but I poked around at the code a bit and the change is more extensive than I expected. @sekmiller and I agreed that I'd push a commit to a new branch so he could get a sense of what the change would look like. Here it is: 4f62855

kcondon commented 6 years ago

Functional changes/ updates to be tested:

APIs are updated to support file doi: Basic access URI: /api/access/datafile/$id All formats tab data bundle download: /api/access/datafile/bundle/$id Metadata/ddi access: /api/access/datafile/$id/metadata/ddi Restrict a file: PUT http://$SERVER/api/files/{id}/restrict Replace a file: POST http://$SERVER/api/files/{id}/replace?key=$apiKey Out of scope: download multiple files, anything on the /meta path, sword delete file, sword get list of files.
Configurable change in form of file doi: INSERT INTO setting( name, content) VALUES (':DataFilePIDFormat', 'DEPENDENT');

INSERT INTO setting( name, content) VALUES (':IdentifierGenerationStyle', 'sequentialNumber');

They work together: DEPENDENT, or INDEPENTDENT determines whether the file doi uses /identifier or just generates a standalone doi as it does now. sequentialNumber was added to support SBGrid but default is random, need to find the keyword for that. All of this should now be doc'd.

IQSS / dataverse

Files: Persistent Identifiers for Files (DOIs for files) #2438