Closed Seanspt closed 4 years ago
In notebook upload, the file is cut and uploaded seperately.
@web.authenticated
@gen.coroutine
def put(self, path=''):
"""Saves the file in the location specified by name and path.
PUT is very similar to POST, but the requester specifies the name,
whereas with POST, the server picks the name.
PUT /api/contents/path/Name.ipynb
Save notebook at ``path/Name.ipynb``. Notebook structure is specified
in `content` key of JSON request body. If content is not specified,
create a new empty notebook.
"""
model = self.get_json_body()
if model:
if model.get('copy_from'):
raise web.HTTPError(400, "Cannot copy with PUT, only POST")
exists = yield maybe_future(self.contents_manager.file_exists(path))
if exists:
yield maybe_future(self._save(model, path))
else:
yield maybe_future(self._upload(model, path))
else:
yield maybe_future(self._new_untitled(path))
However, in s3contents, the file is opened with 'wb' each time.
def write(self, path, content, format):
path_ = self.path(self.unprefix(path))
self.log.debug("S3contents.S3FS: Writing file: `%s`", path_)
if format not in {'text', 'base64'}:
raise HTTPError(
400,
"Must specify format of file contents as 'text' or 'base64'",
)
try:
if format == 'text':
content_ = content.encode('utf8')
else:
b64_bytes = content.encode('ascii')
content_ = base64.b64decode(b64_bytes)
except Exception as e:
raise HTTPError(
400, u'Encoding error saving %s: %s' % (path_, e)
)
with self.fs.open(path_, mode='wb') as f:
f.write(content_)
Trying to fix this.
We've encountered the same issue. This could be a good reference for a fix: https://github.com/jupyter/notebook/blob/master/notebook/services/contents/largefilemanager.py
Is there a fix in progress for this issue? Or a potential workaround? ~thanks
Looking at the LargeFileManager, the file is provided in chunks if the model['chunk'] is present. The last segment has a chunk of -1.
One potential solution could be to start a transaction is s3fs when chunk == 1 and end the transaction when the chunk is -1
Another potential way could be to save the file locally using the LargeFileManager and upload it to S3. Here is a crude way to test it out -
def _save_file(self, model, path):
file_contents = model["content"]
file_format = model.get("format")
chunk = model.get('chunk', None)
large_file_manager = LargeFileManager()
if chunk is not None:
large_file_manager.save(model, path)
if chunk == -1:
updated_model = large_file_manager.get(path)
self.fs.write(path, updated_model['content'], updated_model.get("format"))
large_file_manager.delete_file(path)
else:
self.fs.write(path, file_contents, file_format)
GenericContentsManager._save_file = _save_file
There is also a put method in s3fs that would upload the local file to S3. I am sure that a clean fix would require changes to GenericFileManager and S3FS in the package.
def _save_file(self, model, path):
file_contents = model["content"]
file_format = model.get("format")
chunk = model.get('chunk', None)
large_file_manager = LargeFileManager()
if chunk is not None:
large_file_manager.save(model, path)
if chunk == -1:
os_path = large_file_manager._get_os_path(path)
destination_path = self.fs.path(self.fs.unprefix(path))
self.fs.fs.put(os_path, destination_path)
large_file_manager.delete_file(path)
else:
self.fs.write(path, file_contents, file_format)
I have a working "in-memory" solution using ContextVar
. @danielfrg @ericdill would you be interested in a PR?
The drawbacks:
contextvars
module is part of Python 3.7My first attempt was to use merge
in s3fs
but it does not work, as S3 does not accept multipart upload with parts smaller than 5MB (and JupyterLab is configured to send 1MB chunks)
Hi @pvanliefland, I'd happily review a PR from you that fixes this problem. Regarding your two points, (1) Making this compatible with py37 or newer does not pose a problem for me and (2) can you elaborate a little? Is the concern that the upload would fail only after the user waited for a potentially long time to upload a file?
Hey @ericdill , for 2), to give you a better idea, here is a simplified version of my code. If it makes sense to integrate it here, I'll work on a proper PR.
Let me know!
import base64
import contextvars
# Used as an in-memory "registry" for uploads.
# TODO: periodic cleanup - but when ? As is, could cause memory issues
content_chunks = contextvars.ContextVar("jupyterlab_content_chunks", default={})
def store_content_chunk(path: str, content: str):
"""Store a base64 chunk in the registry as bytes"""
current_value = content_chunks.get()
if path not in current_value:
current_value[path] = []
current_value[path].append(base64.b64decode(content.encode("ascii"), validate=True))
def assemble_chunks(path: str) -> str:
"""Assemble the chunk bytes into a single base64 string"""
current_value = content_chunks.get()
if path not in current_value:
raise ValueError(f"No chunk for path {path}")
return base64.b64encode(b"".join(current_value[path])).decode("ascii")
def delete_chunks(path):
"""Should be called once the upload is complete to free the memory"""
current_value = content_chunks.get()
del current_value[path]
class GenericContentsManager(ContentsManager, HasTraits):
def save(self, model, path=""):
"""Code inspired from notebook.services.contents.largefilemanager"""
chunk = model.get("chunk", None) # this is how the client sends it
if chunk is not None:
# some checks / try&except / logs skipped for readability
if chunk == 1:
self.run_pre_save_hook(model=model, path=path)
# Store the chunk in our "in-memory" registry
store_content_chunk(path, model["content"])
if chunk == -1:
# Last chunk: we want to combine the chunks in the registry to compose the full file content
model["content"] = assemble_chunks(path)
delete_chunks(path)
self._save_file(model, path)
return self.get(path, content=False)
else:
return super().save(model, path)
Hey @pvanliefland, this looks good to me. Certainly better than the current situation where large file uploads just totally fail. I think it makes sense to discuss the remaining concerns in the PR.
@ericdill ok, I'll start working on a PR.
For cleanup of the ContextVar
, I couldn't really find a good example elsewhere: I think that none of the custom ContentsManager implementations handle chunked uploads properly.
As for the standard LargeFileManager
, it doesn't run into the same potential issue: it creates a new file for chunk 1, and simply opens this file in append mode whenever a new chunk is posted.
At this point I'm thinking of either
get()
method is called (not pretty but should do the job)add_timeout
from Tornado (https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop.add_timeout)I'll probably try both approaches.
sounds great, thanks for tackling this!
Hi @pvanliefland how are things going? Want to open a PR with these changes so far? I'd be happy to help test against my jupyterhub instance
Hey @ericdill , thanks for the ping - I'll schedule some time for this tomorrow.
Closed by #99
When upload a file via the jupyter web ui, the file seems to be truncated. The original file size is 29MB. However, when uploading finished with no error, the file we got is only 5.3MB.