django / channels

Developer-friendly asynchrony for Django
https://channels.readthedocs.io
BSD 3-Clause "New" or "Revised" License
6.1k stars 801 forks source link

Work out what to do with file uploads #11

Closed andrewgodwin closed 8 years ago

andrewgodwin commented 9 years ago

Especially large ones, that won't fit in the 1MB message limit. Not sure they can be chunked like the responses can, given that would need perfect ordering.

DasIch commented 8 years ago

I wanted to learn more about django-channels and looking at the documentation and code I had this very same question. I was also struck by how - at least to me - conceptually similar django-channels seems to be to Mongrel2. So I thought it might be interesting to see how Mongrel2 solves the problem.

Unfortunately there doesn't appear to be all that much in terms of documentation but there is this demo. Basically Mongrel2 splits up the request into two messages and writes the body of the request to a file. This way you can respond before the upload is completed to abort or open the file to get the body once the upload is completed.

This approach seem reasonable to me, would avoid the problems with chunking and also allow checking for the progress of the file upload.

andrewgodwin commented 8 years ago

Yeah, Mongrel 2 was a similar-ish architecture, but it was somewhat ill thought out and didn't force proper network transparency - the reason you can't write request bodies to files; the interface and worker may not share a file system.

It's possible the interface server could write to its own filesystem and then serve the file itself over HTTP or something, but that seems pointless; chunked requests is the way to go, I think; in much the same way that response bodies can be chunked.

Only problem is that it's not super easy for consumer code to handle them; in particular, if you send the request chunks as multiple messages they're likely to be picked up by different workers. Given Django already has a media storage configured, however, I think it's quite reasonable to write each chunk as it's received to media storage, and then when the end of the request is received, re-assemble those chunks into the final file and provide that as a file-like object on the request to Django.

agateblue commented 8 years ago

I feel concerned about this issue, since it has already been a blocker from testing django-channels in a a project of mine.

The chunk-splitting part is totally fine to me, however, writing them to media storage seems a bit hackish : media files are usually publicly available (because on production, they are usually served by the load balancer and are not handled by the Django application), which means somebody could access request chunks directly in a web browser (or I am not understanding the solution correctly, which is possible) using the correct URL.

Another point to consider it that not every Django project uses media files, and that a file-based solution may not scale as well as an in-memory solution such as Redis.

For these reasons, I find it more reasonable and consistent to store everything that's request/response-related in the same backend (memory, Redis, etc.). I may totally be wrong here, but since you'll be splitting too large requests into chunks, you'll have to store the chunk order and relations somewhere.

I have not yet the full architecture of channels in mind, but what if:

  1. You store chunks in a different queue so they are not consumed by classic workers
  2. When all chunks are created, you send a message (containing a reference to all chunks and their order) to a dedicated channel
  3. This channel's consumer job is to reassemble the request and send it as a message to the classic, already defined HTTP channel

I don't now if it can be implemented or if it, but I thought it could help.

agateblue commented 8 years ago

Since the 1MB limit is on the message itself, my solution obviously does not work, I'm sorry about that. Maybe in step 3., sending directly the whole content directly to the http consumer (and not as a message) could do the trick ?

andrewgodwin commented 8 years ago

Well, it may be that we need to extend either the maximum message size (a Redis key can reachmuch larger sizes, but other solutions may not), or perhaps specify that some kind of chunk/blob storage is required as part of a channel backend and extend the spec that way.

Message backends already have to have both channel support and locking support, so while I'm not super keen on making them need a third thing it seems like it might be needed in this case. The alternative is finding some way of changing the abstraction so a message can be sent in multiple chunks and all of those route to the same worker.

DasIch commented 8 years ago

Looking at RabbitMQ is interesting in regards to message sizes. The AMQP protocol RabbitMQ uses defines a maximum message size of 2^64 bytes, which is of course irrelevant in practice. The only thing RabbitMQ is constrained by in regards to message size is RAM and bandwidth between nodes (in a cluster setup).

Recommendations on mailing lists and stackoverflow all tend to go towards splitting up files over multiple messages or using an out-of-band mechanism like Redis or Memcached. Some people recommend message sizes of even less than 1MB, others report of experience sending messages several hundred MB large without problems. There doesn't seem to be one optimal message size but 1MB does appear to be a very reasonable choice.

In regards to splitting up messages, I'm wondering whether it wouldn't be possible to send the body split up over several messages, over a separate channel specific to single request. The name of that channel would be referenced in the HTTP request message, so that the worker knows which channel to listen to for the body. This would introduce a third channel type but it wouldn't require substantially more of message backends.

andrewgodwin commented 8 years ago

Yeah, I chose 1MB mostly as it seemed like a size most systems could take. The reality is, as long as there's chunking for everything, maximum size can just become a backend parameter and there's no need to worry, so that's why I don't want the solution to be "more size!!!"

The problem with sending a request over a separate channel is that workers cannot consume multiple messages in one go; I presume you're thinking of some kind of sticky channel, where the worker gets the main message, then adds the body channel to its fetch queue, and then when it's all fetched removes it? That could work.

As I think about this, though, part of me favours making the message body size be technically infinite from a user point of view, and have the backends deal with chunking it as appropriate. The abstraction on how to do this can then be specialised for each backend's strengths; some will just lump it in one message, some will stream one message at a time, some might do direct connection (there's concievably a backend that uses some kind of direct connection and queuing system in there between workers and interface servers, but I'm definitely not good enough to write it)

DasIch commented 8 years ago

[...] I presume you're thinking of some kind of sticky channel, where the worker gets the main message, then adds the body channel to its fetch queue, and then when it's all fetched removes it? That could work.

Yes, exactly.

I completely agree with you on not exposing chunking to users. They should be able to access the entire content transparently, possibly via a file-like API for larger than memory content.

andrewgodwin commented 8 years ago

Hmm, a file-like API would be tricky given messages are presented as dictionaries, but since they're always meant to be dicts, a per-key "all or stream" kind of thing could work well; you can try and do traditional x['foo'] access, but you can also ask for key sizes and a stream object?

andrewgodwin commented 8 years ago

Alright, I've resolved to do this by using a separate channel for the body; the spec for reading channels provides a synchronous, blocking primitive that HTTP handling code can use to do this easily (or an async one could do with more cleverness).

Spec here: http://channels.readthedocs.org/en/latest/asgi.html#request Implementation in AsgiRequest here: https://github.com/andrewgodwin/channels/blob/5348c527780099ecf919543c18841b8cfb230b4f/channels/handler.py#L65

auvipy commented 8 years ago

@andrewgodwin hi anrew, at present can i send any type of files through channels? and do you think using channel as the backbone of a messenger platform a good idea?

andrewgodwin commented 8 years ago

@auvipy Individual messages on channels are limited to a maximum of around 256KB in reality; to send files over it you'd need to chunk them up, and it's not the ideal use case. I'd store files in a centralised file storage service (S3-alike or similar).

It's fine for the part of a messenger platform that transports events around, but you'll also need a centralised store for messages, since channels doesn't provide offline message storage support (messages on channels expire after a minute).

Think of it more as a way to get different parts of the system talking to each other.