jupyter upload file truncation

Seanspt commented 5 years ago

When upload a file via the jupyter web ui, the file seems to be truncated. The original file size is 29MB. However, when uploading finished with no error, the file we got is only 5.3MB.

Seanspt commented 5 years ago

In notebook upload, the file is cut and uploaded seperately.

    @web.authenticated
    @gen.coroutine
    def put(self, path=''):
        """Saves the file in the location specified by name and path.

        PUT is very similar to POST, but the requester specifies the name,
        whereas with POST, the server picks the name.

        PUT /api/contents/path/Name.ipynb
          Save notebook at ``path/Name.ipynb``. Notebook structure is specified
          in `content` key of JSON request body. If content is not specified,
          create a new empty notebook.
        """
        model = self.get_json_body()
        if model:
            if model.get('copy_from'):
                raise web.HTTPError(400, "Cannot copy with PUT, only POST")
            exists = yield maybe_future(self.contents_manager.file_exists(path))
            if exists:
                yield maybe_future(self._save(model, path))
            else:
                yield maybe_future(self._upload(model, path))
        else:
            yield maybe_future(self._new_untitled(path))

However, in s3contents, the file is opened with 'wb' each time.

    def write(self, path, content, format):
        path_ = self.path(self.unprefix(path))
        self.log.debug("S3contents.S3FS: Writing file: `%s`", path_)
        if format not in {'text', 'base64'}:
            raise HTTPError(
                400,
                "Must specify format of file contents as 'text' or 'base64'",
            )
        try:
            if format == 'text':
                content_ = content.encode('utf8')
            else:
                b64_bytes = content.encode('ascii')
                content_ = base64.b64decode(b64_bytes)
        except Exception as e:
            raise HTTPError(
                400, u'Encoding error saving %s: %s' % (path_, e)
            )
        with self.fs.open(path_, mode='wb') as f:
            f.write(content_)

Trying to fix this.

zac-yang commented 4 years ago

We've encountered the same issue. This could be a good reference for a fix: https://github.com/jupyter/notebook/blob/master/notebook/services/contents/largefilemanager.py

rhlarora84 commented 4 years ago

Is there a fix in progress for this issue? Or a potential workaround? ~thanks

rhlarora84 commented 4 years ago

Looking at the LargeFileManager, the file is provided in chunks if the model['chunk'] is present. The last segment has a chunk of -1.

One potential solution could be to start a transaction is s3fs when chunk == 1 and end the transaction when the chunk is -1

Another potential way could be to save the file locally using the LargeFileManager and upload it to S3. Here is a crude way to test it out -

def _save_file(self, model, path):
    file_contents = model["content"]
    file_format = model.get("format")
    chunk = model.get('chunk', None)
    large_file_manager = LargeFileManager()
    if chunk is not None:
        large_file_manager.save(model, path)
        if chunk == -1:
            updated_model = large_file_manager.get(path)
            self.fs.write(path, updated_model['content'], updated_model.get("format"))
            large_file_manager.delete_file(path)
    else:
        self.fs.write(path, file_contents, file_format)
        GenericContentsManager._save_file = _save_file

There is also a put method in s3fs that would upload the local file to S3. I am sure that a clean fix would require changes to GenericFileManager and S3FS in the package.

def _save_file(self, model, path):
    file_contents = model["content"]
    file_format = model.get("format")
    chunk = model.get('chunk', None)
    large_file_manager = LargeFileManager()
    if chunk is not None:
        large_file_manager.save(model, path)
        if chunk == -1:
            os_path = large_file_manager._get_os_path(path)
            destination_path = self.fs.path(self.fs.unprefix(path))
            self.fs.fs.put(os_path, destination_path)
            large_file_manager.delete_file(path)
    else:
        self.fs.write(path, file_contents, file_format)

pvanliefland commented 4 years ago

I have a working "in-memory" solution using ContextVar. @danielfrg @ericdill would you be interested in a PR?

The drawbacks:

The contextvars module is part of Python 3.7
It's in-memory (but we cannot be sure that the user has access to a filesystem)

My first attempt was to use merge in s3fs but it does not work, as S3 does not accept multipart upload with parts smaller than 5MB (and JupyterLab is configured to send 1MB chunks)

ericdill commented 4 years ago

Hi @pvanliefland, I'd happily review a PR from you that fixes this problem. Regarding your two points, (1) Making this compatible with py37 or newer does not pose a problem for me and (2) can you elaborate a little? Is the concern that the upload would fail only after the user waited for a potentially long time to upload a file?

pvanliefland commented 4 years ago

Hey @ericdill , for 2), to give you a better idea, here is a simplified version of my code. If it makes sense to integrate it here, I'll work on a proper PR.

Let me know!

import base64
import contextvars

# Used as an in-memory "registry" for uploads.
# TODO: periodic cleanup - but when ? As is, could cause memory issues
content_chunks = contextvars.ContextVar("jupyterlab_content_chunks", default={})

def store_content_chunk(path: str, content: str):
    """Store a base64 chunk in the registry as bytes"""

    current_value = content_chunks.get()

    if path not in current_value:
        current_value[path] = []

    current_value[path].append(base64.b64decode(content.encode("ascii"), validate=True))

def assemble_chunks(path: str) -> str:
    """Assemble the chunk bytes into a single base64 string"""

    current_value = content_chunks.get()

    if path not in current_value:
        raise ValueError(f"No chunk for path {path}")

    return base64.b64encode(b"".join(current_value[path])).decode("ascii")

def delete_chunks(path):
    """Should be called once the upload is complete to free the memory"""

    current_value = content_chunks.get()
    del current_value[path]

class GenericContentsManager(ContentsManager, HasTraits):
    def save(self, model, path=""):
        """Code inspired from notebook.services.contents.largefilemanager"""

        chunk = model.get("chunk", None)  # this is how the client sends it
        if chunk is not None:
            # some checks / try&except / logs skipped for readability

            if chunk == 1:
                self.run_pre_save_hook(model=model, path=path)
            # Store the chunk in our "in-memory" registry
            store_content_chunk(path, model["content"])

            if chunk == -1:
                # Last chunk: we want to combine the chunks in the registry to compose the full file content
                model["content"] = assemble_chunks(path)
                delete_chunks(path)
                self._save_file(model, path)

            return self.get(path, content=False)
        else:
            return super().save(model, path)

ericdill commented 4 years ago

Hey @pvanliefland, this looks good to me. Certainly better than the current situation where large file uploads just totally fail. I think it makes sense to discuss the remaining concerns in the PR.

"periodic cleanup - but when ?": You raise a good point about memory leakage. Do any of the other content managers solve this problem?

pvanliefland commented 4 years ago

@ericdill ok, I'll start working on a PR.

For cleanup of the ContextVar, I couldn't really find a good example elsewhere: I think that none of the custom ContentsManager implementations handle chunked uploads properly.

As for the standard LargeFileManager, it doesn't run into the same potential issue: it creates a new file for chunk 1, and simply opens this file in append mode whenever a new chunk is posted.

At this point I'm thinking of either

storing a timestamp along the chunks that are being uploaded, and checking if we have "stale" chunks whenever the get() method is called (not pretty but should do the job)
going async with add_timeout from Tornado (https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop.add_timeout)

I'll probably try both approaches.

ericdill commented 4 years ago

sounds great, thanks for tackling this!

ericdill commented 4 years ago

Hi @pvanliefland how are things going? Want to open a PR with these changes so far? I'd be happy to help test against my jupyterhub instance

pvanliefland commented 4 years ago

Hey @ericdill , thanks for the ping - I'll schedule some time for this tomorrow.

ericdill commented 4 years ago

Closed by #99

danielfrg / s3contents

jupyter upload file truncation #80