Open jggautier opened 2 years ago
Next steps:
Looking at the code, I think unzip and the unzip/rezip for shapefiles isn't considered ingest so setting the ingest size limit won't help (could be wrong). That said, because this involves unzip, it could be disabled if the depositor's dataset used a store that has direct upload enabled. (Direct upload doesn't unzip because pulling the files to Dataverse to unzip after direct uploading to the S3 bucket essentially defeats the purpose of putting the file directly in S3 to start with.) Direct upload, when enabled, can be done via the UI or the direct-upload API which can be used directly or via DVUploader. pyDataverse doesn't yet use the direct upload API so it would not handle this case at present.
Thanks as always! The shapefiles that the depositor is uploading are not zipped (and the depositor is trying to prevent the Dataverse software from zipping the files), so I think this particular case involves only zipping. Does what you wrote also mean that direct upload will prevent the Dataverse software from zipping the files, too?
Touched base with Julian on the issue in person. Here is where we are at and next steps:' What we know:
We don't know
Next Steps:
Perhaps I'm missing something but all I see in the Dataverse code is a check for application/zipped-shapefile and code to unzip/rezip. Are we sure it is Dataverse zipping and not the upload util? For example, I see https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/U3YXNX which has unzipped shapefile elements. (FWIW I can't see the RT ticket so I don't know any details in it.)
Ugh, when I download the shapefiles in https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/U3YXNX and re-upload them in a new dataset on Demo Dataverse and on Harvard's repo through the UI (dragging and dropping the files into the "Upload with HTTP via your browser" panel), they aren't being zipped as I've experienced before. I also tested with another set of shapefiles I've used before and Demo and Harvard's repo aren't zipping them. So now I'm confused. This zipping of shapefiles is also what I described in the GitHub comment I mentioned.
@qqmyers, by "the upload util" do you mean DVUploader? As far as I know the depositor is using only the Dataverse Native API for uploading files, including shapefiles that aren't zipped. And the depositor has shared screenshots of the shapefiles being zipped after upload.
DVUploader won't do any file manipulation. I'm just guessing that there may be something else involved that is creating the zip (which Dataverse would then unzip/rezip).
I got some clarification from the depositor about what's happening with their shapefiles. It's a bit different than what I've been describing in this issue so far. They are uploading multiple zip files, and some of those zip files contain sets of shapefiles. For a fake example:
boston.zip contains:
Upon upload, the Dataverse software unzips boston.zip and then rezips only the first 4 files (the four file types mentioned in the Developer Guide). shapefile_set_1.cst and readme.txt are not included in the zip file that the Dataverse software creates.
So after this zipping and partial re-zipping, in the file table you see:
The depositor expects all six files (or however many files are in the actual zip files that the depositor needs to upload) to be in the same zip file. In my made up example, that would include the shapefile_set_1.cst and readme.txt files.
I don't know much about direct upload, but from the Developer guides it sounds like something a Dataverse installation admin would have to enable, right? Maybe this is a workaround that someone else on the team at IQSS could help with?
The depositor let me know that they have no hard deadline for the upload, and they'll continue working on the data that doesn't involve shapefiles, but they would like to get all of the data uploaded as soon as possible. I let them know that we're short handed this week and maybe next week and will continue updating them as we learn more.
Related:
Challenges with how the Dataverse software handles shapefiles was mentioned in a GitHub issue at https://github.com/IQSS/dataverse/issues/6873#issuecomment-624804056. My questions then were more about computational reproducibility. But this has come up again because a depositor I'm trying to help has concerns with how this functionality is complicating their upload of a lot of data onto the Harvard Dataverse Repository.
The depositor, who's using Dataverse APIs to upload files that are not on their computer (I think the files are on an AWS server), may or may not be able to detect and double zip all shapefiles in order to prevent the Dataverse software from zipping the shapefiles when they're uploaded to the repository. I'll ask the depositor if they can do this.
But:
For more context, the email conversation is in IQSS's support email system at https://help.hmdc.harvard.edu/Ticket/Display.html?id=322323 and the data is from Redistricting Data Hub
More broadly, I think more research should be done about the value of the Dataverse software's handling of shapefiles, including the questions and discussion in the GitHub issue comment at https://github.com/IQSS/dataverse/issues/6873#issuecomment-624804056
The issue at https://github.com/IQSS/dataverse/issues/7352 might also be related.
Having ways for depositors to learn about this behavior before they start uploading would be helpful. This behavior is documented only in the Developer guides (https://guides.dataverse.org/en/6.2/developers/geospatial.html) and not in the User Guides or in the UI, although it's referenced in the User Guides.