IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
876 stars 484 forks source link

Slow ingest for relatively big SPSS files #8954

Open lubitchv opened 2 years ago

lubitchv commented 2 years ago

For relatively big SPSS files (150-400MB) ingest is very slow. It usually takes 1-3 hours. Some files can be stuck in the ingest process for 12 hours. Ingest usually takes 100% CPU and hence maximum number of simultaneous ingests only can be less or equal to the number of CPUs on the server. We are in the process of transferring from Nesstar and have thousands of datasets with relatively large SPSS files. With such slow ingest we have difficulties to transition. So it would be useful to have some optimization of ingest code to speed the ingest.

donsizemore commented 2 years ago

@lubitchv one proposal I remember was to have Dataverse only allow n-1 concurrent ingests, where n equals the number of cores available on the node. I don't find that in an open issue, though.

lubitchv commented 2 years ago

Limiting number of concurrent ingests will resolve security issue but will not resolve our problem of uploading to Dataverse relatively large number datasets with ingest in reasonable timeframe.