What this PR does / why we need it: These PR includes multiple changes to the UpdateDatasetVersionCommand to improve the performance/scalability when editing dataset with large numbers of files. Key changes include:
Adding a feature flag to allow disabling the edit-draft logging (separate log files that report changes being made by the current user)
Changing functionality to not update the lastmodifieddate on existing files (since they do not change)
The DatasetVersionDifference optimizations from #10818 (only improves time when edit-draft reporting is still enabled)
Doing an initial merge of the dataset and avoiding subsequent merge/flush operations
Which issue(s) this PR closes:
Closes #10138
Special notes for your reviewer: In my testing on a dataset with 10K files, the time required for the UpdateDatasetVersionCommand in the DatasetPage.save() method to complete (as measured by logging in the save method) when a one char change to the description was made was averaging ~30 seconds. With all the changes in the PR, it now takes ~12-13 seconds. In general, verifying the impact of individual changes is hard:
I see variations of ~2 seconds between repeat runs
The first run after deployment can be ~3-4 seconds longer
Simply logging the time a statement takes can be misleading: in one iteration, I saw that calculating the md5 hash of the :CVocConf setting was taking 2 seconds! While moving the retrieval of that setting as in the PR reduced that time to a ~1ms and produced an overall improvement, the overall change was much smaller than 2 seconds - looks like parallel operations were just slowing that step.
Similarly, while #10818 reduced the difference time from ~12 seconds to < 1 sec when run after operations, trying to do it early led to a ~4-5 second run time - my guess is that some of the time is in lazy loading elements used in the differencing, but I'm not sure.
That said, I would estimate that the first two changes contribute ~4 second reductions each (the feature flag would save 12 seconds, but the differencing PR saves ~ 8 seconds there). The
Suggestions on how to test this: All the automated tests should pass, any/all variants of making changes to a dataset should work as before, there should be no changes w.r.t. the db-level updates except for the change to not update datafile lastmodified dates. Performance should be improved overall and scaling should be improved. The simplest way to test that might be to turn on fine logging for the DatasetPage where I've added logging of the time to run the update command. (Note that the overall time seen in the UI includes both the time to save the changes and the time to reload the page. The latter, with 10K files is still many seconds and hasn't been improved in this PR.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?: Probably one for any/all performance updates going into 6.5 along with announcing the feature flag and change to file last modified behavior.
coverage: 21.012% (+0.1%) from 20.872%
when pulling e0cfcfc30fb4bfde6327e74c8a2ddf9d47baee3e on GlobalDataverseCommunityConsortium:DANS_Performance2
into 068607793b70d6fdd0b0ee1b1a3d2a5bfc2c2574 on IQSS:develop.
What this PR does / why we need it: These PR includes multiple changes to the UpdateDatasetVersionCommand to improve the performance/scalability when editing dataset with large numbers of files. Key changes include:
Which issue(s) this PR closes:
Closes #10138
Special notes for your reviewer: In my testing on a dataset with 10K files, the time required for the UpdateDatasetVersionCommand in the DatasetPage.save() method to complete (as measured by logging in the save method) when a one char change to the description was made was averaging ~30 seconds. With all the changes in the PR, it now takes ~12-13 seconds. In general, verifying the impact of individual changes is hard:
That said, I would estimate that the first two changes contribute ~4 second reductions each (the feature flag would save 12 seconds, but the differencing PR saves ~ 8 seconds there). The
Suggestions on how to test this: All the automated tests should pass, any/all variants of making changes to a dataset should work as before, there should be no changes w.r.t. the db-level updates except for the change to not update datafile lastmodified dates. Performance should be improved overall and scaling should be improved. The simplest way to test that might be to turn on fine logging for the DatasetPage where I've added logging of the time to run the update command. (Note that the overall time seen in the UI includes both the time to save the changes and the time to reload the page. The latter, with 10K files is still many seconds and hasn't been improved in this PR.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?: Probably one for any/all performance updates going into 6.5 along with announcing the feature flag and change to file last modified behavior.
Additional documentation: to be added