🚧 Chunked uploads handling (#80)

pvanliefland commented 4 years ago

This PR addresses #80

This implementation uses contextvars to store the chunks in-memory until the upload of the last chunk (at this point we send the file to S3 or GCS).

Note: using contextvars requires Python 3.7.

It should work for both S3 and GCS.

Todo:

[x] Add chunks utility module
[x] Adapt save() in generic manager
[x] Periodic cleanup of the chunks stored in-memory
[x] Tests
[x] Import contextvars in try/except and disable chunks for Python < 3.7

pvanliefland commented 4 years ago

Hey @ericdill, obviously a work in progress, but wanted to create the PR so you can have a look.

Couldn't find a way to run the tests on my laptop so the PR will help with that too.

ericdill commented 4 years ago

Couple of things that are failing.

Tests should be updated to be only 3.7 or newer.
Notebook dependency 4.* should be dropped. This is where your ImportError is coming from

you can update the test matrix in .github/workflows/test.yml

pvanliefland commented 4 years ago

Ok, so a bit ashamed but as I couldn't find a way to run the tests locally, I use the Github actions to make sure that the tests pass.

I will of course rebase those ugly commits.

I think it is ready for another review @ericdill - I have added some tests and a basic check for staled chunked uploads.

The lint fails, I don't understand the issue with isort, on my local machine it does not generate any error...

In the meantime I will test it using a real Jupyter environment.

ericdill commented 4 years ago

Ok I'll have a look later today.

To run tests locally, I think you need to do the following:

make env
conda activate s3contents
In one terminal, run make minio (after conda activate s3contents)
In another, run make test-all (after conda activate s3contents

Yeah I'm not sure what's up with the lint. Master branch has an error that I can't reproduce locally. Might be a different python version locally vs GH actions. I'll poke at that later today.

pvanliefland commented 4 years ago

Ok, tested the fork in a "real" project and it seems to work

ericdill commented 4 years ago

Hi @pvanliefland @danielfrg,

Regarding backwards compat for python 3.6 specifically, I'm torn. On the one hand, supporting more python versions would be good. On the other, silent failures on file uploads is bad.

Is silent failures on uploads bad enough that we want to drop python 3.6 compatibility? Or should we maintain python 3.6 compat with the understanding that file uploads is broken in py3.6?

edit: Does this block show an error to the user if they try to upload a file?

ericdill commented 4 years ago

Once this PR is in, you might consider submitting a PR to the chunked-saving part of the notebook docs that references this as an example implementation of chunked saving.

danielfrg commented 4 years ago

Is silent failures on uploads bad enough that we want to drop python 3.6 compatibility?

I am also torn but I think that is bad enough, at least on my book. I am also in general happy to just drop things without thinking much about it lol, but I can be convinced otherwise

ericdill commented 4 years ago

@danielfrg is might not be a silent failure? Looks like this code path will 500. I wonder if that's seen by the user.

        if not CHUNKS_HANDLING:
            self.log.error(
                "S3contents.GenericManager._save_large_file: not available with Python < 3.7"
            )
            self.do_error("Chunked file upload in S3/GCS requires Python >= 3.7", 500)

danielfrg commented 4 years ago

Ah, i see. If we log it and show it to the notebook UI then I think we should keep Python 3.6.

pvanliefland commented 4 years ago

@ericdill @danielfrg unfortunately, I've tested how errors are handled and it seems that although they appear in the console, they are not shown displayed to the user.

It means that in the context of chunked uploads, the file won't be uploaded at all (compared to now, where a truncated version of the file is uploaded), no message will be displayed, but the browser console prints a meaningful message.

Let me know how you want to proceed.

danielfrg commented 4 years ago

Thanks for all the work on this!

In that case my vote would be to drop python 3.6. If you agree @ericdill then we can just do that.

ericdill commented 4 years ago

yea, agreed. let's say farewell to python 3.6. continued silent failures on uploads is not a great user experience

ericdill commented 4 years ago

I just tried to use your branch and things worked as expected 🎉 when the folder in the contents manager was a path to an s3 bucket. When I tried to upload to a directory backed by a FileContentsManager, the same behavior happens (truncated uploads). I suspect we should add some guidance to the README here in this repo that you should consider using the LargeFileManager instead of the FileContentsManager if you're having trouble with uploads.

pvanliefland commented 4 years ago

@ericdill @danielfrg Ok, this should be ready to merged.

Latest changes:

Dropped Python 3.6 compat
Updated README so that it uses LargeFileManager instead of FileContentsManager (LargeFileManager has been the default manager in Jupyter notebook for a while now)
Rebased all commits into one

ericdill commented 4 years ago

Thanks for the pr @pvanliefland ! Going to assume @danielfrg is ok with this :) We can always fix any issues @danielfrg has in a follow-on PR.

pvanliefland commented 4 years ago

Cool!

danielfrg / s3contents

🚧 Chunked uploads handling (#80) #99