gdcc / dataverse-previewers

A collection of Datafile Previewers that can be configured to work with Dataverse
MIT License
5 stars 15 forks source link

Spreadsheet viewer has trouble displaying large tabular files #20

Open jggautier opened 1 year ago

jggautier commented 1 year ago

A depositor reported last week that the spreadsheet viewer is having trouble viewing the CSV file they uploaded to the Harvard Dataverse Repository.

Because the file is not published, I can't share it publicly, but the depositor said I could share it privately with any colleagues who want to do more digging. In the meantime, the depositor wrote that they'll add a note in the dataset or file metadata to explain the situation with the file previewer.

The file is 17.4 MB, with 10 columns and 134 rows. The cells in one of the columns has a lot of text. Once the spreadsheet viewer is able to load the preview, it doesn't display all of the columns right away and there's no indication that the viewer is still trying to load parts of the file. This made the depositor think that the viewer would never display all of the columns.

Questions How quickly the viewer can show the entire tabular file depends at least partly on the user's internet speed and/or computer. Is those two factors?

Recommendations

claudiodsf commented 1 year ago

Hi, I was going to post on this same problem today, when I saw this new issue 🙃

We have the same problem on a not-yet-published dataset, which I cannot share, but I found an example on Harvard Dataverse (89.5 MB - 145 Variables, 56200 Observations).

https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/D1N0GO/3NK9D8

I agree on the proposition that there should be a limit on file size (bytes or number of observations) which the admin could configure at install time.

pdurbin commented 1 year ago

It's a somewhat longstanding problem so thank you to @jggautier and @claudiodsf for getting the discussion going here. 😄

My first thought is that the next version of Dataverse (5.13 probably) will include a new feature for the external tools framework whereby tools can express "requirements" that they need to operate. Here's an example...

  "requirements": {
    "auxFilesExist": [
      {
        "formatTag": "NcML",
        "formatVersion": "0.1"
      }
    ]
  }

... from this pull request:

What's going on here is that the NcML preview tool has a requirement that a certain auxiliary file be present for the eyeball to show up (to offer a preview, that is).

Perhaps, like @jggautier suggested with "let installations set a byte size limit specifically for the spreadsheet viewer" each tool could express a size limit, something like this:

  "requirements": {
    "sizeLimitInBytes": 8388608
  }

The idea would be to simply not show the eyeball for large files.

We could get fancier, of course, as suggested above (preview only some rows) and maybe the logic should be in the spreadsheet viewer itself, but I thought I'd at least mention this new "requirements" feature.

For now, docs are here (look for "requirements"): http://preview.guides.gdcc.io/en/develop/api/external-tools.html

It was added in this PR: