Spike: Investigate challenges with how Dataverse software handles shapefiles

jggautier commented 2 years ago

Challenges with how the Dataverse software handles shapefiles was mentioned in a GitHub issue at https://github.com/IQSS/dataverse/issues/6873#issuecomment-624804056. My questions then were more about computational reproducibility. But this has come up again because a depositor I'm trying to help has concerns with how this functionality is complicating their upload of a lot of data onto the Harvard Dataverse Repository.

The depositor, who's using Dataverse APIs to upload files that are not on their computer (I think the files are on an AWS server), may or may not be able to detect and double zip all shapefiles in order to prevent the Dataverse software from zipping the shapefiles when they're uploaded to the repository. I'll ask the depositor if they can do this.

But:

When uploading files using the Dataverse APIs, is it possible to tell the Dataverse software not to put shapefiles in a .zip file?
As @qqmyers mentioned during a meeting this morning, can the repository's ingest settings be temporarily changed so that the Dataverse software doesn't zip the shapefiles in this depositor's uploads?

For more context, the email conversation is in IQSS's support email system at https://help.hmdc.harvard.edu/Ticket/Display.html?id=322323 and the data is from Redistricting Data Hub

More broadly, I think more research should be done about the value of the Dataverse software's handling of shapefiles, including the questions and discussion in the GitHub issue comment at https://github.com/IQSS/dataverse/issues/6873#issuecomment-624804056

The issue at https://github.com/IQSS/dataverse/issues/7352 might also be related.

Having ways for depositors to learn about this behavior before they start uploading would be helpful. This behavior is documented only in the Developer guides (https://guides.dataverse.org/en/6.2/developers/geospatial.html) and not in the User Guides or in the UI, although it's referenced in the User Guides.

mreekie commented 2 years ago

Next steps:

Get more details. Seems like there's the customer question and then the solution.
The problem itself has been around awhile. Can we get a temporary solution for this customer then get this problem reprioritized?

qqmyers commented 2 years ago

Looking at the code, I think unzip and the unzip/rezip for shapefiles isn't considered ingest so setting the ingest size limit won't help (could be wrong). That said, because this involves unzip, it could be disabled if the depositor's dataset used a store that has direct upload enabled. (Direct upload doesn't unzip because pulling the files to Dataverse to unzip after direct uploading to the S3 bucket essentially defeats the purpose of putting the file directly in S3 to start with.) Direct upload, when enabled, can be done via the UI or the direct-upload API which can be used directly or via DVUploader. pyDataverse doesn't yet use the direct upload API so it would not handle this case at present.

jggautier commented 2 years ago

Thanks as always! The shapefiles that the depositor is uploading are not zipped (and the depositor is trying to prevent the Dataverse software from zipping the files), so I think this particular case involves only zipping. Does what you wrote also mean that direct upload will prevent the Dataverse software from zipping the files, too?

mreekie commented 2 years ago

Touched base with Julian on the issue in person. Here is where we are at and next steps:' What we know:

Customer got unexpected behavior when uploading files. They did not expect the files to be zipped.
We think the customer has 100+ datasets at least that they are dealing with so telling them to go ahead and we'll figure this out later is not the best next step.
Leonid appears to have looked at this question previously.
Danny appears to have at some point looked into this and maybe decided that the behavior is OK?
If the behavior is OK it's not documented.
The customer is a dev working on behalf of a group that has alot of experience with geospatial data.
The system is working as designed. This current implementation was done working with World Map and World Map requested or at least OK'd the way this was implemented.
We are stuck this week with so many people being out on vacation.
We don't know
Is the system working correctly? If so then we have a documentation issue to solve.
Does dataverse need to process the affected files in a special way? if so we have code issue or need a workaround to implement.

Next Steps:

We have established and RT ticket to handle the specifics related to helping this customer.
We will let them know what we know and where we are at resource-wise this week via the ticket.
We will retrace the steps taken the last time this question was asked to see if we can find the right person to answer the question about how these files need to be handled. Is there an industry standard we're following?
Next week we will schedule a meeting with the right experts to discuss this and the outcome from that will drive the ultimate solution.
We will re-trace our steps in earlier

qqmyers commented 2 years ago

Perhaps I'm missing something but all I see in the Dataverse code is a check for application/zipped-shapefile and code to unzip/rezip. Are we sure it is Dataverse zipping and not the upload util? For example, I see https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/U3YXNX which has unzipped shapefile elements. (FWIW I can't see the RT ticket so I don't know any details in it.)

jggautier commented 2 years ago

Ugh, when I download the shapefiles in https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/U3YXNX and re-upload them in a new dataset on Demo Dataverse and on Harvard's repo through the UI (dragging and dropping the files into the "Upload with HTTP via your browser" panel), they aren't being zipped as I've experienced before. I also tested with another set of shapefiles I've used before and Demo and Harvard's repo aren't zipping them. So now I'm confused. This zipping of shapefiles is also what I described in the GitHub comment I mentioned.

@qqmyers, by "the upload util" do you mean DVUploader? As far as I know the depositor is using only the Dataverse Native API for uploading files, including shapefiles that aren't zipped. And the depositor has shared screenshots of the shapefiles being zipped after upload.

qqmyers commented 2 years ago

DVUploader won't do any file manipulation. I'm just guessing that there may be something else involved that is creating the zip (which Dataverse would then unzip/rezip).

jggautier commented 2 years ago

I got some clarification from the depositor about what's happening with their shapefiles. It's a bit different than what I've been describing in this issue so far. They are uploading multiple zip files, and some of those zip files contain sets of shapefiles. For a fake example:

boston.zip contains:

shapefile_set_1.dbf
shapefile_set_1.prj
shapefile_set_1.shp
shapefile_set_1.shx
shapefile_set_1.cst
readme.txt

Upon upload, the Dataverse software unzips boston.zip and then rezips only the first 4 files (the four file types mentioned in the Developer Guide). shapefile_set_1.cst and readme.txt are not included in the zip file that the Dataverse software creates.

So after this zipping and partial re-zipping, in the file table you see:

boston.zip (which would contain the first four files)
shapefile_set_1.cst
readme.txt

The depositor expects all six files (or however many files are in the actual zip files that the depositor needs to upload) to be in the same zip file. In my made up example, that would include the shapefile_set_1.cst and readme.txt files.

I don't know much about direct upload, but from the Developer guides it sounds like something a Dataverse installation admin would have to enable, right? Maybe this is a workaround that someone else on the team at IQSS could help with?

The depositor let me know that they have no hard deadline for the upload, and they'll continue working on the data that doesn't involve shapefiles, but they would like to get all of the data uploaded as soon as possible. I let them know that we're short handed this week and maybe next week and will continue updating them as we learn more.

pdurbin commented 2 years ago

8134

IQSS / dataverse

Spike: Investigate challenges with how Dataverse software handles shapefiles #8816

8134