IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
881 stars 492 forks source link

Spike: Investigate challenges with how Dataverse software handles shapefiles #8816

Open jggautier opened 2 years ago

jggautier commented 2 years ago

Challenges with how the Dataverse software handles shapefiles was mentioned in a GitHub issue at https://github.com/IQSS/dataverse/issues/6873#issuecomment-624804056. My questions then were more about computational reproducibility. But this has come up again because a depositor I'm trying to help has concerns with how this functionality is complicating their upload of a lot of data onto the Harvard Dataverse Repository.

The depositor, who's using Dataverse APIs to upload files that are not on their computer (I think the files are on an AWS server), may or may not be able to detect and double zip all shapefiles in order to prevent the Dataverse software from zipping the shapefiles when they're uploaded to the repository. I'll ask the depositor if they can do this.

But:

For more context, the email conversation is in IQSS's support email system at https://help.hmdc.harvard.edu/Ticket/Display.html?id=322323 and the data is from Redistricting Data Hub

More broadly, I think more research should be done about the value of the Dataverse software's handling of shapefiles, including the questions and discussion in the GitHub issue comment at https://github.com/IQSS/dataverse/issues/6873#issuecomment-624804056

The issue at https://github.com/IQSS/dataverse/issues/7352 might also be related.

Having ways for depositors to learn about this behavior before they start uploading would be helpful. This behavior is documented only in the Developer guides (https://guides.dataverse.org/en/6.2/developers/geospatial.html) and not in the User Guides or in the UI, although it's referenced in the User Guides.

mreekie commented 2 years ago

Next steps:

qqmyers commented 2 years ago

Looking at the code, I think unzip and the unzip/rezip for shapefiles isn't considered ingest so setting the ingest size limit won't help (could be wrong). That said, because this involves unzip, it could be disabled if the depositor's dataset used a store that has direct upload enabled. (Direct upload doesn't unzip because pulling the files to Dataverse to unzip after direct uploading to the S3 bucket essentially defeats the purpose of putting the file directly in S3 to start with.) Direct upload, when enabled, can be done via the UI or the direct-upload API which can be used directly or via DVUploader. pyDataverse doesn't yet use the direct upload API so it would not handle this case at present.

jggautier commented 2 years ago

Thanks as always! The shapefiles that the depositor is uploading are not zipped (and the depositor is trying to prevent the Dataverse software from zipping the files), so I think this particular case involves only zipping. Does what you wrote also mean that direct upload will prevent the Dataverse software from zipping the files, too?

mreekie commented 2 years ago

Touched base with Julian on the issue in person. Here is where we are at and next steps:' What we know:

Next Steps:

qqmyers commented 2 years ago

Perhaps I'm missing something but all I see in the Dataverse code is a check for application/zipped-shapefile and code to unzip/rezip. Are we sure it is Dataverse zipping and not the upload util? For example, I see https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/U3YXNX which has unzipped shapefile elements. (FWIW I can't see the RT ticket so I don't know any details in it.)

jggautier commented 2 years ago

Ugh, when I download the shapefiles in https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/U3YXNX and re-upload them in a new dataset on Demo Dataverse and on Harvard's repo through the UI (dragging and dropping the files into the "Upload with HTTP via your browser" panel), they aren't being zipped as I've experienced before. I also tested with another set of shapefiles I've used before and Demo and Harvard's repo aren't zipping them. So now I'm confused. This zipping of shapefiles is also what I described in the GitHub comment I mentioned.

@qqmyers, by "the upload util" do you mean DVUploader? As far as I know the depositor is using only the Dataverse Native API for uploading files, including shapefiles that aren't zipped. And the depositor has shared screenshots of the shapefiles being zipped after upload.

qqmyers commented 2 years ago

DVUploader won't do any file manipulation. I'm just guessing that there may be something else involved that is creating the zip (which Dataverse would then unzip/rezip).

jggautier commented 2 years ago

I got some clarification from the depositor about what's happening with their shapefiles. It's a bit different than what I've been describing in this issue so far. They are uploading multiple zip files, and some of those zip files contain sets of shapefiles. For a fake example:

boston.zip contains:

  1. shapefile_set_1.dbf
  2. shapefile_set_1.prj
  3. shapefile_set_1.shp
  4. shapefile_set_1.shx
  5. shapefile_set_1.cst
  6. readme.txt

Upon upload, the Dataverse software unzips boston.zip and then rezips only the first 4 files (the four file types mentioned in the Developer Guide). shapefile_set_1.cst and readme.txt are not included in the zip file that the Dataverse software creates.

So after this zipping and partial re-zipping, in the file table you see:

  1. boston.zip (which would contain the first four files)
  2. shapefile_set_1.cst
  3. readme.txt

The depositor expects all six files (or however many files are in the actual zip files that the depositor needs to upload) to be in the same zip file. In my made up example, that would include the shapefile_set_1.cst and readme.txt files.

I don't know much about direct upload, but from the Developer guides it sounds like something a Dataverse installation admin would have to enable, right? Maybe this is a workaround that someone else on the team at IQSS could help with?

The depositor let me know that they have no hard deadline for the upload, and they'll continue working on the data that doesn't involve shapefiles, but they would like to get all of the data uploaded as soon as possible. I let them know that we're short handed this week and maybe next week and will continue updating them as we learn more.

pdurbin commented 2 years ago

Related: