IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 486 forks source link

Improve Excel ingest documentation for troubleshooting #7452

Open amberleahey opened 3 years ago

amberleahey commented 3 years ago

Hi! We have had questions about excel data upload failures and the existing documentation here isn't very helpful for troubleshooting unfortuantely https://guides.dataverse.org/en/latest/user/tabulardataingest/excel.html

It would be helpful if there was a template or list of conditions provided here for end users to troubleshoot why their data failed to load. An example of an acceptable spreadsheet structure would also be great, even visuals.

Is this something that can be developed to support users? I'm not familiar with all the conditions and exceptions for Excel data upload but perhaps this is something you have mapped out?

Thanks in advance,

pdurbin commented 3 years ago

@amberleahey I definitely agree the documentation could be improved.

As far as examples, sure, they could probably go in the guides. Another place to fine example files is the "sample data" repo. @philippconzett contributed an Excel file in https://github.com/IQSS/dataverse-sample-data/commit/f3ef7ee that ingests fine and looks like this:

Screen Shot 2020-12-03 at 1 20 15 PM

So maybe the guides could show a few examples and suggest the "sample data" repo for more. Something like that.

amberleahey commented 3 years ago

Yes that would be wonderful, thanks , its a common problem we see, would be good to address this more concretely with researchers if we want them to upload their data in this way.

shlake commented 3 years ago

Excel files with multiple sheets get ingested successfully (?) (I haven't had one to fail) BUT only the 1st sheet is used to create the new ".tab" file. The other sheets are lost in the tabular ingest.

BUT at least the original .xlsx file can be downloaded (via GUI) and still has all the multi sheets.

Problem with downloading the "file" via Download URL - only the ".tab" file is downloaded (via the link shown on the file page), which in this case is not the complete file.

I try to teach my researchers NOT to use multiple sheets, but hey - excel let's them do it, so they do.

adam3smith commented 3 years ago

I don't know if this is the best place to add additional failure conditions, but since I didn't find a better one: Line breaks within cells (which are legal in Excel) break ingest. This may be our most common reason for failure.

pdurbin commented 3 years ago

@adam3smith thanks, yes, it's fine to mention more cases here. That one also happens to have a dedicated issue:

@shlake thanks for mentioning the multiple sheets issue. This limitation should definitely be mentioned in the guides.

Finally, I just noticed that https://guides.dataverse.org/en/5.5/user/tabulardataingest/excel.html has very little content (this is the point @amberleahey was making at the top of this issue). If anyone out there would like to volunteer to get a pull request started, I think any rewrite, however minor, would be a huge improvement. It sounds like there's a lot of real-world experience out there!

BPeuch commented 3 years ago

Hello @shlake, do you know if the multiple-tab problem you mentioned has already been documented in a dedicated issue here? I thought it had been but I can't find it.

shlake commented 3 years ago

Hi @BPeuch I do not think that the multiple-tab (multiple-sheet) problem has its own issue.

I added this particular problem after @pdurbin confirmed in the Google Groups: https://groups.google.com/g/dataverse-community/c/XSuWTOK9JW4/m/TgtCQqAMAAAJ?hl=en

dvictori commented 2 years ago

I've just stumbled on this multiple sheet issue. I believe using multiple excel sheets is a very common practice and asking the researchers to split their files is not an easy task. How about, during the ingest procedure, if an excel file has multiple sheets, it be broken into multiple tab files? So myData.xlsx with two sheets would become myData.sheet1.tab and myData.sheet2.tab?

pdurbin commented 2 years ago

@dvictori I like that idea but of course it needs to go through our design process. And for that it would be nice to have a dedicated issue (a feature request). Do you mind creating one? This issue is more about improving the documentation. (Improvements on https://guides.dataverse.org/en/5.9/user/tabulardataingest/excel.html are still welcome! The source file is at https://github.com/IQSS/dataverse/blob/develop/doc/sphinx-guides/source/user/tabulardataingest/excel.rst .)

adam3smith commented 2 years ago

One thing to think about of with the proposal to automatically split XLSX files is that going from one to many files is a significant conceptual departure of how ingest currently works, with some potentially thorny issues, e.g., how to handle API calls and PIDs.

dvictori commented 2 years ago

I just created a feature request. Since I'm just an user, I have no idea how thorny an issue this is. But I do realize it's a big change. So thanks for listening

pdurbin commented 2 years ago

Just a quick thank you to @kaitlinnewson for the following pull request that doesn't close this issue but greatly improves the docs on Excel:

If anyone wants to work on the docs some more, please pull the latest from the develop branch to pick up these changes.

pdurbin commented 1 year ago

Over at https://groups.google.com/g/dataverse-community/c/aubsLs6RSjQ/m/oG50TR0_BwAJ @amberleahey just linked to some nice (new, I think) documentation about ingest (limitations, common errors) at https://learn.scholarsportal.info/all-guides/borealis/files/#Tabular-Ingest

Here's a screenshot:

Screen Shot 2022-10-05 at 12 48 24 PM

(Note that the screenshot above shows the older style with the warning triangle rather than the kinder, gentler calm blue version you can see is screenshots in pull request #8271, which landed in 5.10.)

Perhaps some of this documentation could be incorporated into the guides themselves, which is what this issue is about. We're open to pull requests! 😄

shlake commented 1 year ago

Adding info to the guides (which I have at UVA), can be done, but users probably don't read all the guides.

This is complex - errors from the ingest are helpful - and in those cases the files don't get ingested.

But then there is the multiple sheet problem where there is no error, but the ingest is not correct - that's what I worry about.

For this on a "simple" level, ingesting of Excel files should check to see if there are multiple sheets, if so - then the file does not get ingested AND an informative notice is sent - specific to this "multiple sheets" error - that the ingest failed due to multiple sheets.