TkTech / ckanext-cloudstorage

Implements support for resource storage against multiple popular providers via apache-libcloud (S3, Azure Storage, etc...)
MIT License
35 stars 55 forks source link

Uploads with S3 Hang Indefinitely #43

Closed thomastu closed 3 years ago

thomastu commented 4 years ago

Motivation

I'm seeing a crash when using S3 after PR #35. Files now hang infinitely when trying to upload. This is on CKAN 2.8 using python 2.7 with the following cloudstorage related settings:

# Plugins
ckan.plugins = stats text_view image_view recline_view oauth2 datastore datapusher resource_proxy cloudstorage

# Cloudstorage Driver
ckanext.cloudstorage.driver = S3_US_WEST
# Bucket Name
ckanext.cloudstorage.container_name = {{my bucket name}}
# Credentials
ckanext.cloudstorage.driver_options = {"key": "{{AWS_ACCESS_KEY_ID}}", "secret": "{{AWS_SECRET_ACCESS_KEY}}"}
# Use Secure URLs for S3
ckanext.cloudstorage.use_secure_urls = 1
# 5-day maximum for stranded uploads
ckanext.cloudstorage.max_multipart_lifetime  = 5

Diagnosis

I think the crash is ultimately coming from how the SpooledTemporaryFile implements next in python 2.7. It looks like SpooledTemporaryFile.next() returns a callable method, when it should just be wrapping the function (i.e. SpooledTemporaryFile.next() returns self._file.next when it should return self._file.next().) In cloudstorage.storage._get_underlying_file, the underlying file.stream is a SpooledTemporaryFile, and so when it gets passed off to libcloud and calls upload_object_via_stream, libcloud will call next() on this file stream expecting it to eventually stop iterating. Since next() just returns a callable, it goes on forever. I verified this by hacking around and changing L25 above to be the following

wrapper.stream.next = wrapper.stream.next()
return wrapper.stream

and uploads started working once more.

TkTech commented 4 years ago

@Zharktas or @amercader, any chance you can look at this?

amercader commented 4 years ago

Yes, I'll check it out @TkTech

thomastu commented 4 years ago

Just to clarify, this isn't a bug with this codebase, but I don't think the real underlying problem will get solved in upstream python, so there would be value in handling it within this specific application. Here's a small example reproducing the root cause of the bug: (assumes an s3 mock running with localstack on port 4572.)

# Start localstack (on fish shell)
env SERVICES=s3 localstack start --host
# bash: `SERVICES=s3 localstack start --host`

# Make the bucket
aws --endpoint-url=http://localhost:4572 s3 mb s3://my-bucket
"""This snippet reproduces what is effectively happening in 
https://github.com/TkTech/ckanext-cloudstorage/blob/master/ckanext/cloudstorage/storage.py#L246
"""
from libcloud.storage.providers import get_driver
from libcloud.storage.types import Provider

from tempfile import SpooledTemporaryFile

if __name__ == "__main__":

    S3Driver = get_driver(Provider.S3_US_WEST)
    driver = S3Driver("key", "secret", host="localhost", port=4572, secure=False)
    container = driver.get_container("my-bucket")

    with SpooledTemporaryFile() as f:
        f.write(b"hello world!")
        f.seek(0)
        # This will hang forever - debugging shows that
        # "<method-wrapper 'next' of cStringIO.StringO object at 0x103435a78>"
        # is just getting written over and over again
        container.upload_object_via_stream(f, "test.txt")