OpenRefine / OpenRefine

OpenRefine is a free, open source power tool for working with messy data and improving it
https://openrefine.org/
BSD 3-Clause "New" or "Revised" License
10.78k stars 1.95k forks source link

Enable Chunked Uploads for Wikimedia Commons uploads, enabling >100MB file uploads #4303

Closed trnstlntk closed 1 month ago

trnstlntk commented 2 years ago

Wikimedia Commons deals with large files in a variety of ways; see https://commons.wikimedia.org/wiki/Commons:Maximum_file_size for some context and pointers.

Some end users of Wikimedia Commons batch upload tools have pointed out that they want to upload batches of larger files (e.g. book scans, video files), where individual files can be 1 to 3 GB or more in size. Such large file sizes are currently not handled very well or straightforwardly in existing upload tools like UploadWizard or Pattypan. It would be desirable if OpenRefine would handle these better, and/or at least communicate clearly which issues exist around such files during upload.

trnstlntk commented 2 years ago

Status update while going through older issues and looking at our current timeline for the Wikimedia-funded project on SDC support.

We are working towards a Minimum Viable Product before end June and end October 2022. At this point it is still quite unclear to us how OpenRefine will deal with uploading large files to Wikimedia Commons. As I see it now: before the Wikimedia grant concludes (October 2022), I think it is feasible to do some stress testing with uploading large files. However, if we discover that they pose major problems, it's unclear if we'll be able to address these before October 2022. It's quite possible that additional work will be needed after that.

wetneb commented 2 years ago

This is a follow-up task to #4682.

Abbe98 commented 2 years ago

This likely requires upstream work as the reason why UploadWizard, Pattypan, etc does not handle this very well is that Wikimedia Commons does not report the progress of chunked uploads(T309094) and that it can take several hours for Commons to publish large files.

The option I have considered for Pattypan is to display a warning suggesting that one request server-side uploads or snail-mails a disk of files to Wikimedia.

trnstlntk commented 2 years ago

@Vesihiisi has done some uploads of TIFF files some time ago with OpenRefine and was unsuccessful, although they could upload them with UploadWizard. Soon (August-September 2022?) they will have more files to test with, if we want to work on this. It may involve doing something with Chunked Uploads, as I understand it correctly, but I encourage others to chime in with more feedback and pointers.

trnstlntk commented 2 years ago

For discoverability purposes, this is the Wikimedia Phabricator issue related to bad reporting of chunked uploads, that @Abbe98 is referring to above - not sure if this is a blocker. Add status / progress information in publishing stage of chunked uploads (api, json)

trnstlntk commented 1 year ago

I have updated this issue's description for clarity, after a Wikimedia community member indicated they were unable to upload a set of files larger than 100MB, and we identified the culprit (Chunked Uploads).

I'm generally wondering: does this issue (limit of 100MB uploads) also affect file upload to Wikibases?

wetneb commented 1 year ago

I'm generally wondering: does this issue (limit of 100MB uploads) also affect file upload to Wikibases?

This is likely, but probably depends on a configuration parameter in the MediaWiki instance. People who run other Wikibases might be able to raise that parameter to let OpenRefine upload larger files.

Abbe98 commented 1 year ago

For discoverability purposes, this is the Wikimedia Phabricator issue related to bad reporting of chunked uploads, that @Abbe98 is referring to above - not sure if this is a blocker. Add status / progress information in publishing stage of chunked uploads (api, json)

It shouldn't be a blocker but users will end up reporting problems with files not being uploaded(as publishing can occur much later than a successful upload).

trnstlntk commented 10 months ago

@Vesihiisi made a very interesting discovery: a user has succeeded to upload files much larger than 100MB. Example: https://commons.wikimedia.org/wiki/File:Cappella_di_Maia_CHA2080-Modifica.tif 🤯

Could this be because the user might have done upload from URL?

https://phabricator.wikimedia.org/T255361#6221949 -- "Chunked upload is for doing uploads in "chunks" (parts) from the local machine, using javascript to the MW API.

This doesn't work for the upload by url, which is a direct request from Wikimedia servers to download the file, done server side, not from the client side."

wetneb commented 10 months ago

Yes that's a good point, the chunked upload mechanism is for uploading local files. I don't know if there is any size limit on upload via URLs.

sebastian-berlin-wmse commented 6 months ago

I've started working on this as part of Wikimedia-OpenRefine training and sustainability project, 2023-24.

sebastian-berlin-wmse commented 5 months ago

After a chat with @wetneb recently we decided that the best way to do the splitting of files is to create a temporary file for each chunk. I should be possible to keep the chunk data in memory, but that would require changes to a library and is probably not worth it at this point.

We also talked a bit about config variables. A few will need to be added, e.g. chunk size. Initially these should live in the manifest, since they're not something the en user typically should change.

sebastian-berlin-wmse commented 4 months ago

We now have the first successful chunked upload to Commons using OR. @Vesihiisi uploaded File:Norrland_naturbeskrifning.pdf using an in development version that I made. There are still several things left before it's ready for a release, but at least it proves that the uploading itself works.

sebastian-berlin-wmse commented 3 months ago

I added some logging to see what takes most time during a chunked upload. These times may not reflect what it's like on Commons since I'm using a local MW instance. This is with a 1,1 GB file uploading in 10 MB chunks.