DSpace / DSpace

(Official) The DSpace digital asset management system that powers your Institutional Repository
https://wiki.lyrasis.org/display/DSDOC8x/
BSD 3-Clause "New" or "Revised" License
894 stars 1.31k forks source link

Admin UI: CSV export from search results can export unlimited amounts of items #9399

Open bram-atmire opened 7 months ago

bram-atmire commented 7 months ago

Describe the bug The CSV export of item metadata that is available to admins tries to export all the items that are part of the search results, regardless how many items are in the search result.

This should ideally be capped in configuration to a relatively safe amount like 500, to avoid killing the entire server by exporting (hundreds of) thousands of items at once.

To Reproduce Steps to reproduce the behavior:

  1. Login as admin
  2. Do a search that gives a lot of results, for example a search without parameters
  3. Hit the button to export the search results to CSV

Expected behavior The export should be capped to a safe amount, for example 500 items.

Related work https://github.com/DSpace/dspace-angular/pull/1443

tdonohue commented 7 months ago

Definitely sounds like a bug. Needs a volunteer. Pulling over to 7.6.x board just to keep track of it. Would likely need fixing in both 8.x and 7.x

alanorth commented 2 months ago

I wouldn't consider this a bug. If you have 1,056 search results—or 18,644—and you want to save them to CSV, then you should be able to. DSpace should either ask for confirmation if the export is over some reasonable amount, or be configured to break exports up into chunks.

The potential for an admin to DOS their server is high so we should help them do it safely. Especially as repositories grow in size, there will always be more records and more use cases for CSV exports.

mwoodiupui commented 1 month ago

More information needed. Why do large CSV exports "kill the server"?

alanorth commented 1 month ago

@mwoodiupui resource usage.

As an anecdote, I frequently use dspace metadata-export on my entire repository to generate a CSV I can interrogate, update, etc. As our repository has grown, the amount of memory required to complete that export has risen—where setting JAVA_OPTS=-Xmx3072m used to work, now I need 5120m to complete without hitting java.lang.OutOfMemoryError. Also, more than once recently, the export has completed successfully, but somehow Tomcat has crashed in the process.

mwoodiupui commented 1 month ago

@alanorth Aha, that's because we're doing it wrong. If I'm correct, the export is being created in dspace-api/src/main/java/org/dspace/app/bulkedit/DSpaceCSV.java. At the top is an Important Note: "Is it wise to...[be] holding an entire CSV upload in memory?" I submit that it is not, and that this is the problem. That is why I asked the question.

DSpaceCSV accumulates a list of "lines", then copies the list into an array, then stuffs them into a StringBuilder, then extracts a String from it, then wraps an InputStream around the String and returns that. So there are at least three copies of the entire result of the request, in memory all at once. What it should do IMHO is return a custom InputStream that uses the Iterator returned by the database layer and composes "lines" from the result set as it is read. Thus the memory demand for the operation becomes negligible regardless the size of the result set.

It seems to me that DSpaceCSV has too many jobs: it is used as an accumulator for both CSV input from a stream, and for CSV output. If input requires accumulation here then import and export should be factored out of the CSV-handling code and designed separately. Export should not require accumulation.

mwoodiupui commented 1 month ago

I sketched up a new class to do this in a more streamlined fashion. It has one issue: it needs to be given the list of metadatafield names in advance.

The naive approach would be to fetch all of the desired Items twice: once to collect unique field names and again to actually produce the CSV. That would be quite costly in time and cycles.

The simple-minded approach would be to just enumerate all of the defined metadatafield names. Quick and simple, but can yield empty columns.

For a single item, the best approach is probably to enumerate its metadata and put their names into the list.

For the entire repository, a fairly simple database query will give a list of the metadatavalue IDs that are actually used by at least one Item. I ran it on the production database for our modest repo: for 42,458 Items having 635,515 metadatavalues it produced a list of 84 distinct field IDs in 593ms (excluding negligible planner time). So, a (barely) noticeable delay, but not awful.

For a single Collection, the same query can be elaborated to select only Items in that Collection.

The tricky bit is a Community. I haven't worked out an approach yet. As a relational expression it may require a recursive query. A completely relational approach might not be possible, but a stored procedure could do it efficiently. The main thing is to do this work in the DBMS and avoid dragging millions of objects into DSpace only to turn them into a huge mound of garbage almost instantly. The actual CSV production will unavoidably do that, but we shouldn't do it twice.