eaquigley commented 9 years ago

Since we are moving towards individual pages for files, we also need to think about what the persistent identifier will be for them.

sekmiller commented 6 years ago

For the creation of PIDs of DataFiles we added a new setting ":DataFilePIDFormat", the default value is DEPENDENT. As such the PID of any DataFile will pre-pend its Dataset's identifier onto the file identifier. That is, if the Dataset doi is doi:10.5072/FK2/ABC123 the datafiles in that dataset will have DOIs that look like this: doi:10.5072/FK2/ABC123/XYZ789, doi:10.5072/FK2/ABC123/QRS456, etc.

In order to not pre-pend the dataset identifier the setting should be entered as INDEPENDENT. If that is the case the dataset's DOIs will not include the dataset first, so in the example above the file DOIs would be: doi:10.5072/FK2/XYZ789, doi:10.5072/FK2/QRS456.

sekmiller commented 6 years ago

@kcondon, the setting for random/sequential is ":IdentifierGenerationStyle"

kcondon commented 6 years ago

I mean the value for :IdentifierGenerationStyle where it is not sequential: randomString

kcondon commented 6 years ago

Found an issue using Dependent/Sequential. Dataset works, adding files throws exception: A system exception occurred during an invocation on EJB DatasetServiceBean, method: public java.lang.Long edu.harvard.iq.dataverse.DatasetServiceBean.getMaximumExistingDatafileIdentifier(edu.harvard.iq.dataverse.Dataset)

Note that all other combinations work: Dependent/random, Independent/random, Independent/sequential

pdurbin commented 6 years ago

@kcondon good catch. Fixed in 15771f3

kcondon commented 6 years ago

Issues with new id structure:

Adding files on dataset create results in 500 error, happens in all combinations of configs:

[2017-12-21T13:23:53.639-0500] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.DatasetPage] [tid: _ThreadID=52 _ThreadName=jk-connector(3)] [timeMillis: 1513880633639] [levelValue: 1000] [[ CommandException, when attempting to update the dataset: Can't execute command edu.harvard.iq.dataverse.engine.command.impl.CreateDataFileCommand@5587b40d, because request [DataverseRequest user:[AuthenticatedUser identifier:@kc122017]@140.247.116.74] is missing permissions [EditDataset] on Object depseq]]

[2017-12-21T13:23:53.668-0500] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dataverse.DatasetPage] [tid: _ThreadID=51 _ThreadName=jk-connector(2)] [timeMillis: 1513880633668] [levelValue: 900] [[ No such dataset: doi:10.5072/FK2/8]]

Using DataCite and Dependentcan create dataset and add files but clicking on a file/landing page results in 500 error either published or draft. Does not happen with Independent, works with ezid and handle:

[2017-12-21T14:33:00.904-0500] [glassfish 4.1] [WARNING] [] [javax.enterprise.resource.webcontainer.jsf.lifecycle] [tid: _ThreadID=57 _ThreadName=jk-connector(4)] [timeMillis: 1513884780904] [levelValue: 900] [[

{FilePage.init}: java.lang.NullPointerException

javax.faces.FacesException: #{FilePage.init}: java.lang.NullPointerException

pdurbin commented 6 years ago

As of 10dab22 (prior to the the fix I mentioned above, 15771f3) after running curl http://localhost:8080/api/admin/settings/:IdentifierGenerationStyle -X PUT -d sequentialNumber I'm getting an error when creating a dataset via GUI (works via API):

screen shot 2017-12-21 at 1 27 11 pm

[2017-12-21T13:26:58.186-0500] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.DatasetPage] [tid: _ThreadID=142 _ThreadName=http-listener-1(3)] [timeMillis: 1513880818186] [levelValue: 1000] [[ CommandException, when attempting to update the dataset: Dataset with identifier ‘4’, protocol ‘doi’ and authority ‘10.5072/FK2’ already exists]]

It seems to be coming from this part of CreateDatasetCommand

    if ((importType != ImportType.MIGRATION && importType != ImportType.HARVEST) && !ctxt.datasets().isIdentifierUniqueInDatabase(theDataset.getIdentifier(), theDataset, idServiceBean)) {
        throw new IllegalCommandException(String.format("Dataset with identifier '%s', protocol '%s' and authority '%s' already exists",
                theDataset.getIdentifier(), theDataset.getProtocol(), theDataset.getAuthority()),
                this);
    }

I'm not sure what's going on. I may drop my database and see if the error goes away.

pdurbin commented 6 years ago

I can't reproduce the error after dropping my database but I took a dump of it in case it's helpful: dataverse_db.sql.txt

pdurbin commented 6 years ago

While I was looking around in DatasetServiceBean.java I noticed that there's a query that is operating against the dataset table but I think it's supposed to be using the dvobject table instead since columns like identifier, protocol, and authority were moved there:

public boolean isIdentifierUniqueInDatabase(String userIdentifier, Dataset dataset, IdServiceBean idServiceBean) {
    String query = "SELECT d FROM Dataset d WHERE d.identifier = '" + userIdentifier + "'";
    query += " and d.protocol ='" + dataset.getProtocol() + "'";
    query += " and d.authority = '" + dataset.getAuthority() + "'";

I'm not sure if this has anything to do with the bugs above, but I thought I'd at least mention it.

pdurbin commented 6 years ago

Actually, somewhere else in the code this works fine but I'm not sure why since those columns were moved to the dvobject table: String queryStr = "SELECT s from Dataset s where s.identifier = :identifier and s.protocol= :protocol and s.authority= :authority"; . Dataset extends DvObject so I guess JPA is able to find the columns or something.

sekmiller commented 6 years ago

Fixed this issue which was caused by the CreateDataFileCommand having a required permission of EditDataset. Now it will require addDataset or EditDataset depending on whether files are added on create or edit.

Issues with new id structure:

Adding files on dataset create results in 500 error, happens in all combinations of configs: [2017-12-21T13:23:53.639-0500] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.DatasetPage] [tid: _ThreadID=52 _ThreadName=jk-connector(3)] [timeMillis: 1513880633639] [levelValue: 1000] [[ CommandException, when attempting to update the dataset: Can't execute command edu.harvard.iq.dataverse.engine.command.impl.CreateDataFileCommand@5587b40d, because request [DataverseRequest user:[AuthenticatedUser identifier:@kc122017]@140.247.116.74] is missing permissions [EditDataset] on Object depseq]]

michbarsinai commented 6 years ago

All in all looks good. Back to @ferrys for some clean-ups based on GitHub code comments.

kcondon commented 6 years ago

Remaining issues: -Need to combine the old v4.8.4 to v4.9 script into the new v4.8.5 to v4.9 script. -Need to document api endpoint to retroactively provide dois for existing files. -Performance issue with a large number of files, 10 files: 6s 50 files: 26s 1000 files 8mins x-2 config settings could maybe be referenced in the doi section rather than simply listed: :IdentifierGenerationStyle, :DataFilePIDFormat -Recently discovered harvesting (import), export and indexing do not account for new file doi metadata. Harvest should continue to work as it does now (currently testing) but net result would be file dois would not be searchable. It was suggested these be opened as separate tickets.

sekmiller commented 6 years ago

I updated the migration script as requested. Also, I think that the documentation for the adding DOIs to existing files belongs in the release notes. It's not something that anyone would run outside of a deployment.

djbrooke commented 6 years ago

Had a discussion after standup about the performance hit. We should resolve the performance as part of this. We'll do a technical investigation first. @scolapasta is going to check in with some PID-knowledgeable folks.

pameyer commented 6 years ago

I did a quick run through with package files (fresh VM). create API works file; DCM success works file; package file is visible on the dataset page under the files tab.

Hovering over the URL shows a url of files.xhtml?persistentId=null:null/null&version=DRAFT ; unsurprisingly clicking this link leads to an error. Checking dvobject and datafile tables shows that DB doesn't have an identifier for the package file.

Did not proceed with run through (submit for review, return from review, publish w\ RSAL, etc).

djbrooke commented 6 years ago

Thanks for the discussion after standup about this.

Is the cause of the performance issues the communication with the DOI providers? Or is it due to changes in the code that we made? @landreev will investigate this.
We do want PIDs for package files, even though it's somewhat redundant right now. In the future, package files will be able to able to live alongside HTTP-uploaded files. This is an easy change.

mercecrosas commented 6 years ago

can't wait for @landreev investigationsresults!

On Thu, Feb 1, 2018 at 11:39 AM, Danny Brooke notifications@github.com wrote:

Thanks for the discussion after standup about this.

Is the cause of the performance issues the communication with the DOI providers? Or is it due to changes in the code that we made? @landreev https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_landreev&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=O40Nnyryux6B3Q_4LnX9ByrEs6ryhAnyExdPckx21kg&s=boVcas5OpfhunErDVFz4spuO0wYj6ELRfelJkVy4UJE&e= will investigate this.

We do want PIDs for package files, even though it's somewhat redundant right now. In the future, package files will be able to able to live alongside HTTP-uploaded files. This is an easy change.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_2438-23issuecomment-2D362324485&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=O40Nnyryux6B3Q_4LnX9ByrEs6ryhAnyExdPckx21kg&s=j3rymNifsfH8Ck_TjaR2GOBcHP-mk6eE1cGNyfNH_xs&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AApQyHokfCrHidMuBfFnWYLZOecI-2D1kTks5tQei9gaJpZM4FnlnG&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=O40Nnyryux6B3Q_4LnX9ByrEs6ryhAnyExdPckx21kg&s=6EBuZul30S-otPwVHD1k2cdRX7yYwo5oZBiLq2YTf34&e= .

pameyer commented 6 years ago

As of 915c9fa1ff2208e3b0f9ab585ec37bb110651566 , package files get PIDs - thanks @sekmiller

djbrooke commented 6 years ago

Had a quick discussion after standup today:

We should index the File PIDs so they can be found in search
We should include File PIDs in Import and Export

landreev commented 6 years ago

To reiterate, I'm only investigating one specific area - whether the performance degradation we are seeing is entirely/mostly due to having to make those registration calls to the naming authority. We have reasons to assume that this is the case; but I wanted to make sure. (We used to do certain things - moving files around, etc., in single batch passes; now these tasks have been moved into the CreateFileCommand, which is called separately for each file... so there is at least some potential for a performance hit). Will report shortly.

landreev commented 6 years ago

One idea, as a followup to the discussion we've just had, about the global ids vs. export:

When we add the global ids to the file sections of the JSON and DDI exports, let's add them in the form of the resolvable global ID url. (e.g. "http://dx.doi.org/10.5072/FK2/XXX", and NOT "doi:10.5072/FK2/XXX")
By doing this, we can achieve several things:
We'll have this global URL shown somewhere. (according to @kcondon, it's not shown anywhere on any pages)
We'll solve the issue a partner (Solomon?) recently complained about: since the DDIs presently show the Access API urls, by following those, users can download public files without ever seeing any Terms of Use that may be associated with the files. (remember - we are still not enforcing the terms of use on API downloads - it is done on the page level only!). If we replace these URLs with the global id URLs, those resolve to the file.xhtml pages - and that should (somewhat) solve the problem. Knowing the global id, one can still cook the access API url... but it's not the same. I don't think we are going to break anything - I'm not aware of any client relying on these access API urls in our DDIs.

landreev commented 6 years ago

Also, indexing global ids does have some value - I was wrong on that. Searching for an individual global id, once you already know it, doesn't make sense. BUT searching for all the files in a specific name space is often useful.

landreev commented 6 years ago

The Terms of Use issue: #2911 The issue is 2 years old, but there's a recent comment from a specific partner, indicating that this is causing a problem for them.

pameyer commented 6 years ago

@landreev For JSON and DDI, I'd thought the recommendation for machine-readable formats was to use doi: instead of http:. But if I'm wrong, we should probably be using https instead of http.

adam3smith commented 6 years ago

The guidance on this is confusing, but I'd essentially never use doi: anymore. So either the format allows something like <identifier identifierType="DOI">10.3886/ICPSR03674.V2</identifier> (from Datacite XML here) where DOI is specified in an attribute or you use the full resolved DOI in the form https://doi.org/<doi> as DataCite does in their JSON-LD: "@id": "https://doi.org/10.3886/icpsr03674.v2",

landreev commented 6 years ago

Hmm. We've been using http://dx.doi.org/10.7910/DVN/XXX (same for the handles equivalents) everywhere. what's the importance of https there? - they are just links to the global resolver... The actual registered urls that these resolve to are https urls of our dataset pages.

adam3smith commented 6 years ago

Detailed reason including tl:dr; at the top here: https://www.crossref.org/blog/linking-dois-using-https-the-background-to-our-new-guidelines/

The dx. isn't really problematic (and will keep working indefinitely) but it's unnecessary and discouraged by the DOI fundation, CrossRef, and DataCite, so you should take that out as well.

landreev commented 6 years ago

@adam3smith and everybody: For the global id of the dataset we've been/we'll continue using

<IDNo agency="DOI">doi:10.7910/DVN/XXXXX</IDNo>

in the DDI. So we are leaving it to the user/software client to construct a resolvable url out of this identifier. I've been talking about the URLs we supply in the individual file sections further down:

<fileDscr ID="fYYYYY" URI="https://dataverse.harvard.edu/api/access/datafile/YYYYY">

this is the direct access API url; and YYYYY is the internal database id of the file. (the "ID" attribute is the XML id that references the section; it's an opaque string - we don't need to worry about using a meaningful identifier there) So I've been proposing replacing it with

<fileDscr ID="fYYYYY" URI="http://dx.doi.org/10.7910/DVN/ZZZZZ">

where 10.7910/DVN/ZZZZZ is the registered DOI for the file... So I'm not even suggesting we mess with how we are encoding any identifiers; we'd be replacing one URL with another...

landreev commented 6 years ago

... And I'm not opposed to still be adding the "doi:10.7910/DVN/ZZZZZ" version of the id somewhere else in that "fileDscr" section... Not sure where - the DDI does not seem to have dedicated field for the identifier of an individual file... Could be a note:

<note type="global id, file">doi:10.7910/DVN/ZZZZZ</note>

pameyer commented 6 years ago

@landreev -

<fileDscr ID="fYYYYY" URI="http://dx.doi.org/10.7910/DVN/ZZZZZ">

un-confused me; thanks. My suggestion about using doi:$foo wouldn't make sense in this context.

landreev commented 6 years ago

OK, I spent more time than I was planning on that performance investigation today... Comparing the dev. and this branches, I can confirm that all the performance degradation in uploading and saving files does come from the registration calls. It sounds like doing it asynchronously would be something to explore...

However, it appears that the EditDatafiles/upload files page is very seriously broken, already in the dev. branch, independently of the individual global ids for files.

One standard test we used to run when optimizing that page was uploading and saving 1000 very small file. Not a very realistic test, but useful to detect serious performance penalties per single file. It used to work very fast. It no longer works at all.

Something got broken in the rendering of the files list. You still have the 1000 files upload fairly quickly. Then the page starts rendering the file list, adding the files 1 by 1 and fully re-rendering the entire list every time... renders one file, erases the list, adds the second file, renders 2 files, erases... By the 300th or so file the page is unusable.

So basically it's something you're not going to notice when uploading 1 or 3 files at a time. But it makes it impossible to upload large numbers of files at the same time.

I don't know when it got broken. Some work was done on the page last summer; and some just last month.

I don't think fixing this belongs in this ticket (totally unrelated to the file identifiers). But I feel like it needs to be fixed.

djbrooke commented 6 years ago

Thanks @landreev. I saw that you created #4452 to cover the files list performance issue and we'll handle it there. We should address the registration-call related performance issues before moving this one to done.

pdurbin commented 6 years ago

@sekmiller had tweaked BatchImportIT in 02bf719 and asked me to take a look. I made some improvements to the roundTripDdi method in b8090f0.

landreev commented 6 years ago

(adding @scolapasta) There is definitely a reproducible database update conflict, when some of the files in question are being ingested, in parallel with the global id assignment. Even more importantly, there's another issue that @sekmiller identified yesterday - when a user clicks "publish" immediately after uploading the files, there can be a scenario of 2 global id assignment jobs racing against one another.

The only clean way of doing this seems to be to handle both the global id assignment and ingest as one "post upload" task, running them in sequence and using the same ingest lock for the duration of the combined task.

Let's discuss, if necessary.

landreev commented 6 years ago

In the light of this, I would maybe like to hear this explained one more time - why we insist on having these ids assigned on upload? Still seems like a potentially sensible approach to leave it till publish.

landreev commented 6 years ago

I made a commit disabling the asynchronous job for persistent ids. @sekmiller do you need to check in something else into this branch, or is it ready?

Also, reviewed the implications of the functionality in this branch for harvested files. We do NOT want to assign persistent identifiers for harvested DataFiles, ever. It appears that we are safe. Harvested files are never created with CreateDataFileCommand outside CreateDatasetCommand. And the latter always calls the former with the skipAssigningIdentifier=true, like this:

ctxt.engine().submit(new CreateDataFileCommand(dataFile, dsv, getRequest(), dataFileIdentifier, true));//skip registration

landreev commented 6 years ago

@sekmiller so which test(s) is failing now?

sekmiller commented 6 years ago

On FilesIT test_AddFileBadUploadFormat was failing - it’s really testing that you shouldn’t be able to upload a file via the api in the when the upload method is rsync and in the absence of a script. After consulting with Phil I commented out the test because it really shouldn’t be run without the DCM running in the background.

so now all tests are passing.

On Mar 27, 2018, at 2:40 PM, landreev notifications@github.com<mailto:notifications@github.com> wrote:

@sekmillerhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sekmiller&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=d7bbsLIcERTG2Qe_gE_iKH1z80JxaTHhJpRXaS4mxwM&m=WPn-hZZ4hRiQBEZ2lDhumpo6d6A2uRsVZjFORYnfMjY&s=R1VJJLKzjVfGY5e1Y7jlxjOCyRjMxz0E1da22dPQBZc&e= so which test(s) is failing now?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_2438-23issuecomment-2D376631629&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=d7bbsLIcERTG2Qe_gE_iKH1z80JxaTHhJpRXaS4mxwM&m=WPn-hZZ4hRiQBEZ2lDhumpo6d6A2uRsVZjFORYnfMjY&s=gVuNxNQ2ha8C0UPWJvMKlAP9RFHUczEqC2ZC1BUFQ8Q&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ACQWAI0G2-2DJwyX1xerD2n-5FiOOCkGiF5Uks5tioe4gaJpZM4FnlnG&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=d7bbsLIcERTG2Qe_gE_iKH1z80JxaTHhJpRXaS4mxwM&m=WPn-hZZ4hRiQBEZ2lDhumpo6d6A2uRsVZjFORYnfMjY&s=l1xVHwuhICHsKWf2ebhIxx_SucMsJ1ZBdrrXIjoBN8M&e=.

landreev commented 6 years ago

@kcondon @scolapasta asked me to help review the parts of the PR that specifically deal with locks. From looking at the code, it all looks good. But, as part of QA, please double-check that the Dataset page properly refreshes itself after the dataset becomes unlocked; (either after ingests finish, or when the persistent id registration is complete). That auto-refreshing has always been tricky. So every time we touch the locks, there is at least a possibility for it to stop working properly.

pdurbin commented 6 years ago

At standup I mentioned that pull request #4350 had merge conflicts. I merged the latest from develop into that branch which resulted in the SQL script getting renamed. In 639562f I fixed the SQL scripts so that upgrade_v4.8.5_to_v4.8.6.sql only has the 4.8.6 updates and a new script has the updates for 4.9.

kcondon commented 6 years ago

ISSUES FOUND SINCE ASYNC REGISTRATION ON PUBLISH WAS IMPLEMENTED:

X -no file cards on file upload/save to dataset X -exception and inability to publish dataset:

Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/collection1: ERROR: [doc=datafile_49] unknown field 'filePersistentId

X -for every dataset card displayed on page, see the following error in log:

[2018-04-03T14:29:31.431-0400] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dataverse.util.BundleUtil] [tid: _ThreadID=51 _ThreadName=jk-connector(2)] [timeMillis: 1522780171431] [levelValue: 900] [[
  Could not find key "advanced.search.datasets.persistentId" in bundle file.]]

X -Info banner about file doi registration disappears when refresh page even though lock still present/ not finished. -General performance issues with a large number of files. They encompass the issues mentioned below but have timed actions as file counts increase for evaluation. see: https://docs.google.com/spreadsheets/d/1TzKFLy01UoUjWL0PBMJ_WDLOoqF9HnnFgUVBSVDX5io/edit?usp=sharing

X -1000 files takes 26 minutes to save versus 1 min after dataset create, same branch or 1 minute on /develop -1000 files takes 5 minutes to upload (selected not zip) on dataset create, 2 mins same branch to an already created dataset. Note that this issue exists in /develop -Publishing any new version of a dataset with minor metadata change but a large number of existing files takes a long time to publish, about 1/3 time of initial publish to register files, likely due to needing to update file metadata at doi provider. X -Publishing intermittently fails to display info banner and clear the lock when done. I have seen this happen when there is also a server log error about a null ptr in db conn pool. Publish does eventually complete but lock remains. Google search shows this may result from invalid db conns still in connection pool. Log file shows:

[2018-04-03T18:41:08.253-0400] [glassfish 4.1] [INFO] [] [edu.harvard.iq.dataverse.workflow.WorkflowServiceBean] [tid: _ThreadID=49 _ThreadName=jk-connector(1)] [timeMillis: 1522795268253] [levelValue: 800] [[
  Searching for workflow step providers...]]

[2018-04-03T18:41:08.379-0400] [glassfish 4.1] [SEVERE] [poolmgr.system_exception] [javax.enterprise.resource.resourceadapter.com.sun.enterprise.resource] [tid: _ThreadID=49 _ThreadName=jk-connector(1)] [timeMillis: 1522795268379] [levelValue: 1000] [[
  RAR5031:System Exception
java.lang.NullPointerException

landreev commented 6 years ago

The solr error looks like a missing schema update. "no file cards on file upload" - this could also be a result of merging this branch with my fix for the editfiles fragment?

landreev commented 6 years ago

Retracting what I just said about the file card - I thought you were talking about the file entry on the dataset page. If it was a search card then yes, it must be the solr schema issue.

landreev commented 6 years ago

@scolapasta @sekmiller

Regarding the latest problem Kevin is seeing/reporting, above - that the performance is back to super unacceptable, but only when large numbers of files are uploaded on Create (1000 small files - takes 26 min. to save in Create Dataset mode; but only some number of seconds when uploading and saving the same files in an existing dataset). It really looks like this is the result of calling CreateDataFileCommand from inside CreateDataset command. When files are added to existing datasets, the CreateDataFileCommand is called by the page (and not from inside UpdateDatasetCommand). It looks like calling a command from inside another command results in the EJB trying to do everything as one big transaction; and transaction overhead gets worse (exponentially?) with the number of nested objects. So when uploading 100 files on Create, it is still acceptably fast (some fraction of a second per file); at 1000 files, it appears that it is taking 2 seconds per each CreateDataFileCommand.

I'm guessing the solution is to move the CreateDataFile command out of CreateDatasetCommand. Save the dataset first, without files, then call CreateDataFile command for every file separately - like we do when files are added to existing datasets.

Alternatively, maybe we could just junk the CreateDataFile command (it's new in this branch); now that we don't want to create/register file ids on create.

sekmiller commented 6 years ago

I refactored the createDatasetCommand so that it doesn't call the CreateDataFileCommand in a nested fashion. When I tried the "1000 files" test creating the Dataset took less than a minute and publishing it took less than 10, but the Dataset Page was locked throughout the publish process and came back automatically when the publish was completed.

landreev commented 6 years ago

Adding relevant info about the current state of the branch; this is from @sekmiller via slack on Fri.:

we've seen a database connection error when the number of files on a dataset is well over 500. The logs show errors like: Severe:   RAR5031:System Exception
java.lang.NullPointerException
    at com.sun.enterprise.resource.ConnectorXAResource.getResourceHandle(ConnectorXAResource.java:246)
    at com.sun.enterprise.resource.ConnectorXAResource.end(ConnectorXAResource.java:159)
    at com.sun.enterprise.transaction.JavaEETransactionManagerSimplified.delistResource(JavaEETransactionManagerSimplified.java:527)
If you wait long enough the dataset will eventually publish and all of the files will be registered, however, the page is never locked to the user but at the end of the process the dataset gets it lock and it doesn't go away.
I put a hack into the Datapage init such that if a dataset is locked for PIDRegistration but there's no draft version, we try to remove the lock.
Is there a better way to approach this?

To me this looks very wrong. And I don't feel this really is about the right way of handling an exception - why are we getting that exception in the first place? Specifically, a database connection error? Are we somehow trying to register these file DOIs for the 500+ files at the same time - instead of doing int sequentially in the background? Steve also said that

They at least get identifiers in the database I didn't check that they are all actually registered at the EZID admin console, but a spot check shows that all the ones I have looked for there have been registered.

so at least it's looking like that asynchronous job finishes doing what it needs to do, just never removes the locks properly... Still, this database exception, and the mess it leaves behind looks scary to me. And I don't think we should leave it like this, just because 500+ files is an "edge case".

landreev commented 6 years ago

I also added two specific change requests in the review. One for removing an Async attribute that doesn't seem to serve any purpose; and another for the part in the FinalizeDatasetPublicationCommand where we appear to be trying to delete the same lock twice... No good reason to think either of these 2 things are causing trouble, but who knows .

landreev commented 6 years ago

A quick summary, of the latest developments:

the identifiers for files are not immediately assigned on upload; for as long as the file is in draft, it has no global id. so the db. ids are used in urls pointing to the file landing pages, etc.
the identifiers are assigned when the version is published.
if there are 10 or fewer files (the limit is configurable, but 10 is the default), the identifiers are assigned as part of the single publishing task. Meaning, you click publish, you (eventually) get the success message, all the files should have global identifiers. (if the global ids cannot be assigned, publish should fail).
if there are more than 10, it goes into the async mode. The version becomes locked, with a message clearly explaining what's going on. While locked, the version should still appear as an unpublished draft. Once the global ids are assigned in the background, the version should get automatically unlocked and appear as published.
So both the sync (10 or fewer files) and async modes should be tested separately.
Publishing datasets with large numbers of files should definitely be tested carefully. The expected behavior is that it should work with any number of files; with the async. part simply taking longer (the rule of thumb is that it costs .5 sec. per file to register the global id). But the last issue we ran into was that with more than 500 files or so, we were running into a weird concurrency issue where there was a conflict between the first (the synchronous) Publish command and the asynchronous part (the FinalizePublish command) that was messing up the locking and/or unlocking and sometimes preventing the dataset from getting published. It appears to have been fixed and everything should be working as advertised now.

kcondon commented 6 years ago

Tested the above checklist by leonid and all works as described. Also nearly completed test checklist: https://docs.google.com/document/d/1VOlDJFRy7zoS4LVDerH5TgMYsr3DKkm9ZRNcqajitxY/edit?usp=sharing all works for ezid but handles and datacite do not register ids. handle does not create them, datacite now creates but does not register and is slow.

IQSS / dataverse

Files: Persistent Identifiers for Files (DOIs for files) #2438

{FilePage.init}: java.lang.NullPointerException