Closed eaquigley closed 6 years ago
For the creation of PIDs of DataFiles we added a new setting ":DataFilePIDFormat", the default value is DEPENDENT. As such the PID of any DataFile will pre-pend its Dataset's identifier onto the file identifier. That is, if the Dataset doi is doi:10.5072/FK2/ABC123 the datafiles in that dataset will have DOIs that look like this: doi:10.5072/FK2/ABC123/XYZ789, doi:10.5072/FK2/ABC123/QRS456, etc.
In order to not pre-pend the dataset identifier the setting should be entered as INDEPENDENT. If that is the case the dataset's DOIs will not include the dataset first, so in the example above the file DOIs would be: doi:10.5072/FK2/XYZ789, doi:10.5072/FK2/QRS456.
@kcondon, the setting for random/sequential is ":IdentifierGenerationStyle"
I mean the value for :IdentifierGenerationStyle where it is not sequential: randomString
Found an issue using Dependent/Sequential. Dataset works, adding files throws exception: A system exception occurred during an invocation on EJB DatasetServiceBean, method: public java.lang.Long edu.harvard.iq.dataverse.DatasetServiceBean.getMaximumExistingDatafileIdentifier(edu.harvard.iq.dataverse.Dataset)
Note that all other combinations work: Dependent/random, Independent/random, Independent/sequential
@kcondon good catch. Fixed in 15771f3
Issues with new id structure:
[2017-12-21T13:23:53.639-0500] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.DatasetPage] [tid: _ThreadID=52 _ThreadName=jk-connector(3)] [timeMillis: 1513880633639] [levelValue: 1000] [[ CommandException, when attempting to update the dataset: Can't execute command edu.harvard.iq.dataverse.engine.command.impl.CreateDataFileCommand@5587b40d, because request [DataverseRequest user:[AuthenticatedUser identifier:@kc122017]@140.247.116.74] is missing permissions [EditDataset] on Object depseq]]
[2017-12-21T13:23:53.668-0500] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dataverse.DatasetPage] [tid: _ThreadID=51 _ThreadName=jk-connector(2)] [timeMillis: 1513880633668] [levelValue: 900] [[ No such dataset: doi:10.5072/FK2/8]]
[2017-12-21T14:33:00.904-0500] [glassfish 4.1] [WARNING] [] [javax.enterprise.resource.webcontainer.jsf.lifecycle] [tid: _ThreadID=57 _ThreadName=jk-connector(4)] [timeMillis: 1513884780904] [levelValue: 900] [[
javax.faces.FacesException: #{FilePage.init}: java.lang.NullPointerException
As of 10dab22 (prior to the the fix I mentioned above, 15771f3) after running curl http://localhost:8080/api/admin/settings/:IdentifierGenerationStyle -X PUT -d sequentialNumber
I'm getting an error when creating a dataset via GUI (works via API):
[2017-12-21T13:26:58.186-0500] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.DatasetPage] [tid: _ThreadID=142 _ThreadName=http-listener-1(3)] [timeMillis: 1513880818186] [levelValue: 1000] [[ CommandException, when attempting to update the dataset: Dataset with identifier ‘4’, protocol ‘doi’ and authority ‘10.5072/FK2’ already exists]]
It seems to be coming from this part of CreateDatasetCommand
if ((importType != ImportType.MIGRATION && importType != ImportType.HARVEST) && !ctxt.datasets().isIdentifierUniqueInDatabase(theDataset.getIdentifier(), theDataset, idServiceBean)) {
throw new IllegalCommandException(String.format("Dataset with identifier '%s', protocol '%s' and authority '%s' already exists",
theDataset.getIdentifier(), theDataset.getProtocol(), theDataset.getAuthority()),
this);
}
I'm not sure what's going on. I may drop my database and see if the error goes away.
I can't reproduce the error after dropping my database but I took a dump of it in case it's helpful: dataverse_db.sql.txt
While I was looking around in DatasetServiceBean.java
I noticed that there's a query that is operating against the dataset table but I think it's supposed to be using the dvobject table instead since columns like identifier
, protocol
, and authority
were moved there:
public boolean isIdentifierUniqueInDatabase(String userIdentifier, Dataset dataset, IdServiceBean idServiceBean) {
String query = "SELECT d FROM Dataset d WHERE d.identifier = '" + userIdentifier + "'";
query += " and d.protocol ='" + dataset.getProtocol() + "'";
query += " and d.authority = '" + dataset.getAuthority() + "'";
I'm not sure if this has anything to do with the bugs above, but I thought I'd at least mention it.
Actually, somewhere else in the code this works fine but I'm not sure why since those columns were moved to the dvobject
table: String queryStr = "SELECT s from Dataset s where s.identifier = :identifier and s.protocol= :protocol and s.authority= :authority";
. Dataset extends DvObject so I guess JPA is able to find the columns or something.
Fixed this issue which was caused by the CreateDataFileCommand having a required permission of EditDataset. Now it will require addDataset or EditDataset depending on whether files are added on create or edit.
Issues with new id structure:
Adding files on dataset create results in 500 error, happens in all combinations of configs: [2017-12-21T13:23:53.639-0500] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.DatasetPage] [tid: _ThreadID=52 _ThreadName=jk-connector(3)] [timeMillis: 1513880633639] [levelValue: 1000] [[ CommandException, when attempting to update the dataset: Can't execute command edu.harvard.iq.dataverse.engine.command.impl.CreateDataFileCommand@5587b40d, because request [DataverseRequest user:[AuthenticatedUser identifier:@kc122017]@140.247.116.74] is missing permissions [EditDataset] on Object depseq]]
All in all looks good. Back to @ferrys for some clean-ups based on GitHub code comments.
Remaining issues: -Need to combine the old v4.8.4 to v4.9 script into the new v4.8.5 to v4.9 script. -Need to document api endpoint to retroactively provide dois for existing files. -Performance issue with a large number of files, 10 files: 6s 50 files: 26s 1000 files 8mins x-2 config settings could maybe be referenced in the doi section rather than simply listed: :IdentifierGenerationStyle, :DataFilePIDFormat -Recently discovered harvesting (import), export and indexing do not account for new file doi metadata. Harvest should continue to work as it does now (currently testing) but net result would be file dois would not be searchable. It was suggested these be opened as separate tickets.
I updated the migration script as requested. Also, I think that the documentation for the adding DOIs to existing files belongs in the release notes. It's not something that anyone would run outside of a deployment.
Had a discussion after standup about the performance hit. We should resolve the performance as part of this. We'll do a technical investigation first. @scolapasta is going to check in with some PID-knowledgeable folks.
I did a quick run through with package files (fresh VM). create API works file; DCM success works file; package file is visible on the dataset page under the files tab.
Hovering over the URL shows a url of files.xhtml?persistentId=null:null/null&version=DRAFT
; unsurprisingly clicking this link leads to an error. Checking dvobject
and datafile
tables shows that DB doesn't have an identifier for the package file.
Did not proceed with run through (submit for review, return from review, publish w\ RSAL, etc).
Thanks for the discussion after standup about this.
can't wait for @landreev investigationsresults!
On Thu, Feb 1, 2018 at 11:39 AM, Danny Brooke notifications@github.com wrote:
Thanks for the discussion after standup about this.
- Is the cause of the performance issues the communication with the DOI providers? Or is it due to changes in the code that we made? @landreev https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_landreev&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=O40Nnyryux6B3Q_4LnX9ByrEs6ryhAnyExdPckx21kg&s=boVcas5OpfhunErDVFz4spuO0wYj6ELRfelJkVy4UJE&e= will investigate this.
- We do want PIDs for package files, even though it's somewhat redundant right now. In the future, package files will be able to able to live alongside HTTP-uploaded files. This is an easy change.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_2438-23issuecomment-2D362324485&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=O40Nnyryux6B3Q_4LnX9ByrEs6ryhAnyExdPckx21kg&s=j3rymNifsfH8Ck_TjaR2GOBcHP-mk6eE1cGNyfNH_xs&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AApQyHokfCrHidMuBfFnWYLZOecI-2D1kTks5tQei9gaJpZM4FnlnG&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=O40Nnyryux6B3Q_4LnX9ByrEs6ryhAnyExdPckx21kg&s=6EBuZul30S-otPwVHD1k2cdRX7yYwo5oZBiLq2YTf34&e= .
As of 915c9fa1ff2208e3b0f9ab585ec37bb110651566 , package files get PIDs - thanks @sekmiller
Had a quick discussion after standup today:
To reiterate, I'm only investigating one specific area - whether the performance degradation we are seeing is entirely/mostly due to having to make those registration calls to the naming authority. We have reasons to assume that this is the case; but I wanted to make sure. (We used to do certain things - moving files around, etc., in single batch passes; now these tasks have been moved into the CreateFileCommand, which is called separately for each file... so there is at least some potential for a performance hit). Will report shortly.
One idea, as a followup to the discussion we've just had, about the global ids vs. export:
Also, indexing global ids does have some value - I was wrong on that. Searching for an individual global id, once you already know it, doesn't make sense. BUT searching for all the files in a specific name space is often useful.
The Terms of Use issue: #2911 The issue is 2 years old, but there's a recent comment from a specific partner, indicating that this is causing a problem for them.
@landreev For JSON and DDI, I'd thought the recommendation for machine-readable formats was to use doi:
instead of http:
. But if I'm wrong, we should probably be using https
instead of http
.
The guidance on this is confusing, but I'd essentially never use doi: anymore. So either the format allows something like <identifier identifierType="DOI">10.3886/ICPSR03674.V2</identifier>
(from Datacite XML here) where DOI is specified in an attribute or you use the full resolved DOI in the form https://doi.org/<doi>
as DataCite does in their JSON-LD:
"@id": "https://doi.org/10.3886/icpsr03674.v2",
Hmm. We've been using http://dx.doi.org/10.7910/DVN/XXX (same for the handles equivalents) everywhere. what's the importance of https there? - they are just links to the global resolver... The actual registered urls that these resolve to are https urls of our dataset pages.
Detailed reason including tl:dr; at the top here: https://www.crossref.org/blog/linking-dois-using-https-the-background-to-our-new-guidelines/
The dx. isn't really problematic (and will keep working indefinitely) but it's unnecessary and discouraged by the DOI fundation, CrossRef, and DataCite, so you should take that out as well.
@adam3smith and everybody: For the global id of the dataset we've been/we'll continue using
<IDNo agency="DOI">doi:10.7910/DVN/XXXXX</IDNo>
in the DDI. So we are leaving it to the user/software client to construct a resolvable url out of this identifier. I've been talking about the URLs we supply in the individual file sections further down:
<fileDscr ID="fYYYYY" URI="https://dataverse.harvard.edu/api/access/datafile/YYYYY">
this is the direct access API url; and YYYYY is the internal database id of the file. (the "ID" attribute is the XML id that references the section; it's an opaque string - we don't need to worry about using a meaningful identifier there) So I've been proposing replacing it with
<fileDscr ID="fYYYYY" URI="http://dx.doi.org/10.7910/DVN/ZZZZZ">
where 10.7910/DVN/ZZZZZ is the registered DOI for the file... So I'm not even suggesting we mess with how we are encoding any identifiers; we'd be replacing one URL with another...
... And I'm not opposed to still be adding the "doi:10.7910/DVN/ZZZZZ" version of the id somewhere else in that "fileDscr" section... Not sure where - the DDI does not seem to have dedicated field for the identifier of an individual file... Could be a note:
<note type="global id, file">doi:10.7910/DVN/ZZZZZ</note>
@landreev -
<fileDscr ID="fYYYYY" URI="http://dx.doi.org/10.7910/DVN/ZZZZZ">
un-confused me; thanks. My suggestion about using doi:$foo
wouldn't make sense in this context.
OK, I spent more time than I was planning on that performance investigation today... Comparing the dev. and this branches, I can confirm that all the performance degradation in uploading and saving files does come from the registration calls. It sounds like doing it asynchronously would be something to explore...
However, it appears that the EditDatafiles/upload files page is very seriously broken, already in the dev. branch, independently of the individual global ids for files.
One standard test we used to run when optimizing that page was uploading and saving 1000 very small file. Not a very realistic test, but useful to detect serious performance penalties per single file. It used to work very fast. It no longer works at all.
Something got broken in the rendering of the files list. You still have the 1000 files upload fairly quickly. Then the page starts rendering the file list, adding the files 1 by 1 and fully re-rendering the entire list every time... renders one file, erases the list, adds the second file, renders 2 files, erases... By the 300th or so file the page is unusable.
So basically it's something you're not going to notice when uploading 1 or 3 files at a time. But it makes it impossible to upload large numbers of files at the same time.
I don't know when it got broken. Some work was done on the page last summer; and some just last month.
I don't think fixing this belongs in this ticket (totally unrelated to the file identifiers). But I feel like it needs to be fixed.
Thanks @landreev. I saw that you created #4452 to cover the files list performance issue and we'll handle it there. We should address the registration-call related performance issues before moving this one to done.
@sekmiller had tweaked BatchImportIT
in 02bf719 and asked me to take a look. I made some improvements to the roundTripDdi
method in b8090f0.
(adding @scolapasta) There is definitely a reproducible database update conflict, when some of the files in question are being ingested, in parallel with the global id assignment. Even more importantly, there's another issue that @sekmiller identified yesterday - when a user clicks "publish" immediately after uploading the files, there can be a scenario of 2 global id assignment jobs racing against one another.
The only clean way of doing this seems to be to handle both the global id assignment and ingest as one "post upload" task, running them in sequence and using the same ingest lock for the duration of the combined task.
Let's discuss, if necessary.
In the light of this, I would maybe like to hear this explained one more time - why we insist on having these ids assigned on upload? Still seems like a potentially sensible approach to leave it till publish.
I made a commit disabling the asynchronous job for persistent ids. @sekmiller do you need to check in something else into this branch, or is it ready?
Also, reviewed the implications of the functionality in this branch for harvested files. We do NOT want to assign persistent identifiers for harvested DataFiles, ever. It appears that we are safe. Harvested files are never created with CreateDataFileCommand outside CreateDatasetCommand. And the latter always calls the former with the skipAssigningIdentifier=true, like this:
ctxt.engine().submit(new CreateDataFileCommand(dataFile, dsv, getRequest(), dataFileIdentifier, true));//skip registration
@sekmiller so which test(s) is failing now?
On FilesIT test_AddFileBadUploadFormat was failing - it’s really testing that you shouldn’t be able to upload a file via the api in the when the upload method is rsync and in the absence of a script. After consulting with Phil I commented out the test because it really shouldn’t be run without the DCM running in the background.
so now all tests are passing.
On Mar 27, 2018, at 2:40 PM, landreev notifications@github.com<mailto:notifications@github.com> wrote:
@sekmillerhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sekmiller&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=d7bbsLIcERTG2Qe_gE_iKH1z80JxaTHhJpRXaS4mxwM&m=WPn-hZZ4hRiQBEZ2lDhumpo6d6A2uRsVZjFORYnfMjY&s=R1VJJLKzjVfGY5e1Y7jlxjOCyRjMxz0E1da22dPQBZc&e= so which test(s) is failing now?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_2438-23issuecomment-2D376631629&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=d7bbsLIcERTG2Qe_gE_iKH1z80JxaTHhJpRXaS4mxwM&m=WPn-hZZ4hRiQBEZ2lDhumpo6d6A2uRsVZjFORYnfMjY&s=gVuNxNQ2ha8C0UPWJvMKlAP9RFHUczEqC2ZC1BUFQ8Q&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ACQWAI0G2-2DJwyX1xerD2n-5FiOOCkGiF5Uks5tioe4gaJpZM4FnlnG&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=d7bbsLIcERTG2Qe_gE_iKH1z80JxaTHhJpRXaS4mxwM&m=WPn-hZZ4hRiQBEZ2lDhumpo6d6A2uRsVZjFORYnfMjY&s=l1xVHwuhICHsKWf2ebhIxx_SucMsJ1ZBdrrXIjoBN8M&e=.
@kcondon @scolapasta asked me to help review the parts of the PR that specifically deal with locks. From looking at the code, it all looks good. But, as part of QA, please double-check that the Dataset page properly refreshes itself after the dataset becomes unlocked; (either after ingests finish, or when the persistent id registration is complete). That auto-refreshing has always been tricky. So every time we touch the locks, there is at least a possibility for it to stop working properly.
At standup I mentioned that pull request #4350 had merge conflicts. I merged the latest from develop into that branch which resulted in the SQL script getting renamed. In 639562f I fixed the SQL scripts so that upgrade_v4.8.5_to_v4.8.6.sql only has the 4.8.6 updates and a new script has the updates for 4.9.
ISSUES FOUND SINCE ASYNC REGISTRATION ON PUBLISH WAS IMPLEMENTED:
X -no file cards on file upload/save to dataset X -exception and inability to publish dataset:
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/collection1: ERROR: [doc=datafile_49] unknown field 'filePersistentId
X -for every dataset card displayed on page, see the following error in log:
[2018-04-03T14:29:31.431-0400] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dataverse.util.BundleUtil] [tid: _ThreadID=51 _ThreadName=jk-connector(2)] [timeMillis: 1522780171431] [levelValue: 900] [[
Could not find key "advanced.search.datasets.persistentId" in bundle file.]]
X -Info banner about file doi registration disappears when refresh page even though lock still present/ not finished. -General performance issues with a large number of files. They encompass the issues mentioned below but have timed actions as file counts increase for evaluation. see: https://docs.google.com/spreadsheets/d/1TzKFLy01UoUjWL0PBMJ_WDLOoqF9HnnFgUVBSVDX5io/edit?usp=sharing
X -1000 files takes 26 minutes to save versus 1 min after dataset create, same branch or 1 minute on /develop -1000 files takes 5 minutes to upload (selected not zip) on dataset create, 2 mins same branch to an already created dataset. Note that this issue exists in /develop -Publishing any new version of a dataset with minor metadata change but a large number of existing files takes a long time to publish, about 1/3 time of initial publish to register files, likely due to needing to update file metadata at doi provider. X -Publishing intermittently fails to display info banner and clear the lock when done. I have seen this happen when there is also a server log error about a null ptr in db conn pool. Publish does eventually complete but lock remains. Google search shows this may result from invalid db conns still in connection pool. Log file shows:
[2018-04-03T18:41:08.253-0400] [glassfish 4.1] [INFO] [] [edu.harvard.iq.dataverse.workflow.WorkflowServiceBean] [tid: _ThreadID=49 _ThreadName=jk-connector(1)] [timeMillis: 1522795268253] [levelValue: 800] [[
Searching for workflow step providers...]]
[2018-04-03T18:41:08.379-0400] [glassfish 4.1] [SEVERE] [poolmgr.system_exception] [javax.enterprise.resource.resourceadapter.com.sun.enterprise.resource] [tid: _ThreadID=49 _ThreadName=jk-connector(1)] [timeMillis: 1522795268379] [levelValue: 1000] [[
RAR5031:System Exception
java.lang.NullPointerException
The solr error looks like a missing schema update. "no file cards on file upload" - this could also be a result of merging this branch with my fix for the editfiles fragment?
Retracting what I just said about the file card - I thought you were talking about the file entry on the dataset page. If it was a search card then yes, it must be the solr schema issue.
@scolapasta @sekmiller
Regarding the latest problem Kevin is seeing/reporting, above - that the performance is back to super unacceptable, but only when large numbers of files are uploaded on Create (1000 small files - takes 26 min. to save in Create Dataset mode; but only some number of seconds when uploading and saving the same files in an existing dataset). It really looks like this is the result of calling CreateDataFileCommand from inside CreateDataset command. When files are added to existing datasets, the CreateDataFileCommand is called by the page (and not from inside UpdateDatasetCommand). It looks like calling a command from inside another command results in the EJB trying to do everything as one big transaction; and transaction overhead gets worse (exponentially?) with the number of nested objects. So when uploading 100 files on Create, it is still acceptably fast (some fraction of a second per file); at 1000 files, it appears that it is taking 2 seconds per each CreateDataFileCommand.
I'm guessing the solution is to move the CreateDataFile command out of CreateDatasetCommand. Save the dataset first, without files, then call CreateDataFile command for every file separately - like we do when files are added to existing datasets.
Alternatively, maybe we could just junk the CreateDataFile command (it's new in this branch); now that we don't want to create/register file ids on create.
I refactored the createDatasetCommand so that it doesn't call the CreateDataFileCommand in a nested fashion. When I tried the "1000 files" test creating the Dataset took less than a minute and publishing it took less than 10, but the Dataset Page was locked throughout the publish process and came back automatically when the publish was completed.
Adding relevant info about the current state of the branch; this is from @sekmiller via slack on Fri.:
we've seen a database connection error when the number of files on a dataset is well over 500. The logs show errors like: Severe: RAR5031:System Exception
java.lang.NullPointerException
at com.sun.enterprise.resource.ConnectorXAResource.getResourceHandle(ConnectorXAResource.java:246)
at com.sun.enterprise.resource.ConnectorXAResource.end(ConnectorXAResource.java:159)
at com.sun.enterprise.transaction.JavaEETransactionManagerSimplified.delistResource(JavaEETransactionManagerSimplified.java:527)
If you wait long enough the dataset will eventually publish and all of the files will be registered, however, the page is never locked to the user but at the end of the process the dataset gets it lock and it doesn't go away.
I put a hack into the Datapage init such that if a dataset is locked for PIDRegistration but there's no draft version, we try to remove the lock.
Is there a better way to approach this?
To me this looks very wrong. And I don't feel this really is about the right way of handling an exception - why are we getting that exception in the first place? Specifically, a database connection error? Are we somehow trying to register these file DOIs for the 500+ files at the same time - instead of doing int sequentially in the background? Steve also said that
They at least get identifiers in the database I didn't check that they are all actually registered at the EZID admin console, but a spot check shows that all the ones I have looked for there have been registered.
I also added two specific change requests in the review. One for removing an Async attribute that doesn't seem to serve any purpose; and another for the part in the FinalizeDatasetPublicationCommand where we appear to be trying to delete the same lock twice... No good reason to think either of these 2 things are causing trouble, but who knows .
A quick summary, of the latest developments:
Tested the above checklist by leonid and all works as described. Also nearly completed test checklist: https://docs.google.com/document/d/1VOlDJFRy7zoS4LVDerH5TgMYsr3DKkm9ZRNcqajitxY/edit?usp=sharing all works for ezid but handles and datacite do not register ids. handle does not create them, datacite now creates but does not register and is slow.
Since we are moving towards individual pages for files, we also need to think about what the persistent identifier will be for them.