datopian / ckanext-blob-storage

CKAN extension to offload blob storage to cloud storage providers (S3, GCS, Azure etc).
http://tech.datopian.com/blob-storage/
MIT License
14 stars 4 forks source link

[upload] browser crashes while trying to upload large file (~600Mb) #36

Closed mbeilin closed 3 years ago

mbeilin commented 3 years ago

Browser crashes while trying to upload large CSV file (~600Mb) for the resource in NHS Staging (https://ckan.nhs.staging.datopian.com/)

Please see the screencast

Acceptance

Analysis

@shevron as per our discussion previously today, and from I checked after, it looks like the problem here is that we are sending the actual file through Giftless and not directly into the cloud storage (GoogleCloud specifically here). So this approach is working fine with the small files (up to ~100Mb), but not with the big ones.

rufuspollock commented 3 years ago

@Daniellappv @mbeilin i really doubt this is related to anything go through giftless - there is nothing in that system to pass through data. It is much more likely to be a browser issue e.g. the browser struggling to hash the file ...

shevron commented 3 years ago

The site you pointed to (nhs) is configured to upload directly to Google Cloud. Giftless is only used to get the signed URL, and CKAN is not a part of the process at all. This is a client side issue, where something the JS code does (probably either hashing the file or inspecting it for schema?) is causing a browser crash.

shevron commented 3 years ago

Def. a client side issue - notice the "Uncaught (in promise) out of memory" error:

Screenshot from 2020-10-08 20-27-06

(the 1st error seems to be logged but ignored, and most likely unrelated / separate issue).

This when trying to upload a 1gb file on Firefox / Linux locally. It happens before Giftless is even hit.

Suggest moving this issue to datapub / ckan3-js-sdk or at least opening an issue there that references this one. /cc @mariorodeghiero

shevron commented 3 years ago

Just from taking a quick look at the ckan3-js-sdk code I see https://github.com/datopian/ckan-client-js/blob/master/lib/file.js#L56 is loading the entire file contents to memory. I don't know if this is where we crash but it must be something like this. A big no-no if we need to support large files.

mariorodeghiero commented 3 years ago

Yes, probably the root cause is because of the hash, but the SDK is using data.js now https://github.com/datopian/data.js/blob/master/src/index.js#L191

mariorodeghiero commented 3 years ago

The hash issue should be fixed in this PR https://github.com/datopian/data.js/pull/90 in data.js

Note: I'm not sure if the rest of the data.js methods likefile.rows() and file.addSchema() will work for large files. So if you got a different error please use the data.js repository.

I am closing the hash issue here @mbeilin if you got an error about generate hash please use this https://github.com/datopian/data.js/issues/89

cc: @shevron