Batch message publishing

thoughtless commented 10 years ago

I would like firehose to accept a batch of messages for various channels with a single HTTP request.

The POST body might look like this:

{
  "/my/unique/channel": [{"pay":"load"}, {"would":"have been the body"}],
  "/some/other/channel": [{"more":"data"}, {"that would":"have been the body"}]
}

The array is an ordered set of what would have been in the request body if the requests were made separately.
There can be multiple keys. I.e. we can update multiple Firehose channels with a single request.
Ideally the server would be able to start processing the JSON while it is still being uploaded (i.e. stream it. Don't wait for the full POST body to be uploaded before processing the JSON.)
Responses might look something like

{
  "/my/unique/channel": {success:2, error:0, last_sequence:4},
  "/some/other/channel": {success:1, error:1, last_sequence:9}
}

It might also be reasonable to return a 500 error in the case of an error, since we would generally expect all messages to be processed. If any of the POSTed JSON is unexpectedly incorrect (for example, if one element in the array isn't a valid message to publish to a firehose channel), then we could return a 400.

The use cases for this are:

Reduce the overhead of HTTPS handshakes and HTTP headers.
Make it easier to write clients which:
- throttle requests
- drop unnecessary messages (see rational for this in https://github.com/polleverywhere/firehose/issues/35)

thoughtless commented 10 years ago

We'd also need to figure out how the ttl option (AKA max-age) should be handled.

Option 1)

{
  "/my/unique/channel": {
    messages: [{"pay":"load"}, {"would":"have been the body"}],
    ttl: 90
  },
  "/some/other/channel": {
    messages: [{"more":"data"}, {"that would":"have been the body"}],
    ttl: 120
  }
}

Option 2)

{
  "/my/unique/channel": [{ttl: 90, message: {"pay":"load"}}, {ttl: 60, message: {"would":"have been the body"}}],
  "/some/other/channel": [{ttl: 100, message: {"more":"data"}}, {ttl: 120, message: {"that would":"have been the body"}}]
}

I think option 1 is better because 1) It is simpler 2) It is less verbose 3) TTL is implemented as a per-channel option anyway, not per message. (The TTL from the most recently published message overwrites the previous TTL.)

bradgessler commented 10 years ago

To dive right into the options, Option 1 is correct in that the ttl is defined per channel. Its also future proof in that we can add the buffer-size attribute to the channel when we implement that feature.

As you correctly point out, Option 2 doesn't make sense since the ttl is a channel level setting.

I'd like to propose a third option that I think would result in a simpler publisher/consumer:

[
  {
    channel: '/my/unique/channel',
    payload: {"pay":"load"},
    ttl: 90
  },
  {
    channel: '/my/unique/channel',
    payload: {"would":"have been the body"},
    ttl: 90
  },
  {
    channel: "/some/other/channel",
    payload: {"more":"data"},
    ttl: 120,
    buffer_size: 1
  },
  {
    channel: "/some/other/channel",
    payload: {"that would":"have been the body"},
    ttl: 120,
    buffer_size: 1
  }
]

The client can be less stateful when putting together a message to send to the server (e.g. you don't need to check if a channel exists or not before adding messages to it) and this lends itself better to being a "stream" of events.

This protocol would also be useful for connection multiplexing in the Firehose.js library on the client side so that we don't have to open up 10 different WS connections for 10 different channels:

[
  {
    channel: '/my/unique/channel',
    payload: {"pay":"load"},
    sequence: 1
  },
  {
    channel: "/some/other/channel",
    payload: {"that would":"have been the body"},
    sequence: 3
  }
]

Taking a step back, have you considered that HTTP/HTTPS is the wrong protocol for publishing to firehose? Perhaps a direct connection to the Redis pub/sub instances would be performant without HTTP connection overhead. There are also other protocols like protobuf that could speed things up.

thoughtless commented 10 years ago

Those are some great ideas. I totally agree that the 3rd option for syntax you have proposed is much more stream based and simpler for the client.

However, with the use case I have in mind, I actually want my client to be more stateful. My goal is to address https://github.com/polleverywhere/firehose/issues/35 in the publisher. That way the unneeded messages are not even sent to the server. That is much more efficient than sending them to the server and having the server drop them.

The other reason I like option 1 better than option 3 is that it would allow the server to batch-update Redis. The current pull request doesn't do this; it just publishes 1 message at a time to redis. However, in theory we could have a single redis lua script that would batch update each channel. Thus we'd only have 1 redis command for each channel in the batch update.

The idea of using other protocols is very interesting. I think it requires more sysadmin overhead though, so I'm not personally as interested in it at the moment.

edit: I like Option 1 not 2

bradgessler commented 10 years ago

For Option 3 the client can still be more stateful, but that additional state doesn't have to be imposed on the protocol. If you impose state in the protocol then at a minimum any client also needs to be stateful. If you don't impose state in the protocol, you can have both simple and more complex clients.

thoughtless commented 10 years ago

I'm going to do some performance tests to see how much benefit I get from batching requests client side vs. both batching and dropping unneeded requests client side.

Regarding stateful vs. stateless, I'm not sure that is the best comparison between the two options. Any client that deals with batching is stateful. One uses an array for state, the other uses a dictionary/hash. The array is definitely simpler state. It also lends itself towards streaming. For example, the client could continue adding to the array while it is in the process of uploading the server.

The hash state requires a bit more logic client side. But if the server can benefit from this logic (such as batching redis updates) then that makes the system more scalable. Firehose clients (including publishers) are more horizontally scalable than Firehose servers.

bradgessler commented 8 years ago

I've started to see the stream format emerge on the server in the MessageBuffer::Message object. That resembles:

[
  {
    "message": "Hi there",
    "channel": "/greetings/from/earth",
    "ttl": "90"
  },
  {
    "message": "Bee boop",
    "channel": "/greetings/from/mars",
    "ttl": "60"
  },
  {
    "message": "Hi there again",
    "channel": "/greetings/from/earth",
    "ttl": "30"
  }
]

The Firehose server would process these messages from top to bottom and preserve the clients' intent on the order it wants the messages published (though we can't make guarantees on the order messages are published, we should at least make a best effort).

I also want to attempt define a more consistent message format for subscribing and publishing. Our clients are already consuming a stream of sequential messages, which an array best approximates.

For the service that's publishing to Firehose, it could batch up a bunch of messages with a Batch publisher. Consider a struct of Firehose messages in Ruby:

batch = Firehose::Publisher::Batch.new
batch.messages << Firehose::Message.new("Hi there", channel: "/greetings/from/earth", ttl: 90, buffer_size: 1)
batch.messages << Firehose::Message.new("Bee boop", channel: "/greetings/from/mars", ttl: 60)
batch.messages << Firehose::Message.new("Hi there again", channel: "/greetings/from/earth", ttl: 30, buffer_size: 1)
batch.publish

That messages would be converted into the JSON as shown above and sent to the Firehose server for processing. The Firehose client would be free to implement rate limiting, etc. based on the contents of the message (e.g. Messages with buffer_size of 1 could drop the older messages, determined by order, on publish)

We could guarantee the order of publishing received by the Firehose server from the JSON payload via the Lua script since that's executed atomically.

firehoseio / firehose

Batch message publishing #42