apa512 / clj-rethinkdb

Eclipse Public License 1.0
205 stars 42 forks source link

callback to run? #20

Closed sritchie closed 8 years ago

sritchie commented 9 years ago

Is it possible to run async queries with a callback as a second arg? (Of course it's not implemented yet, just trying to figure out how I might hook in to this.)

Happy to help implement. Thanks!

apa512 commented 9 years ago

Async querying sounds like a nice addition. Have you looked at how https://github.com/bitemyapp/revise does it?

sritchie commented 9 years ago

Looks like he dispatches a future to do the work then dumps the results into a core.async channel. If you're down for that I can add that. Do you want run to return a channel by default? Or how about we add a run-async function that returns a channel if it takes one arg, or accepts a callback as a second arg? That'd match the javascript API.

sritchie commented 9 years ago

I probably can't get to this in the next few days, but I definitely want this feature and can make it happen.

apa512 commented 9 years ago

Adding run-async is probably safest. I wouldn't mind having it be the default personally, but maybe that's too confusing for someone who hasn't used core.async?

danielytics commented 9 years ago

I agree with @apa512 - I also would be happy with it as default, but think it raises barrier to entry.

jwr commented 9 years ago

Two observations:

Implementing asynchronicity is something to be considered in a wider context.

danielytics commented 9 years ago

On @jwr's second point - this is something I've been wondering about: how to handle connection failures (eg the RethinkDB server you were connected to goes down). Currently I have a loop that attempts to reconnect a few times before failing, but it only works with one server. Since RethinkDB is clustered, it would be great to somehow maintain a list of nodes and try to connect to a different one if there is a failure. It would be great if this could somehow be handled behind the scenes.

apa512 commented 9 years ago
danielytics commented 9 years ago

@apa512 I'm a big fan of core.async too, but I think perhaps it should be a layer above the core rethinkdb driver? Perhaps a wrapper around run could be added in a separate namespace, say rethinkdb.async?

I've added an issue for connection pooling #23. I'm happy to help out on this (time permitting), so we can flesh out how it should work there.

@jwr I can answer some of your questions:

jwr commented 9 years ago

@apa512 Just to clarify, I use core.async a lot and I'm a big fan of it. But I try not to make components dependent on it — if a callback will do, I use a callback. This means that the caller has the freedom to choose what to use, and even if it's eventually a channel — the protocol to use on that channel.

As an example, I have apps with a pub/sub-based dispatcher. It can take input from channels, in a [key value] format. If you forced the use of core.async, you likely would use a different format, so I would need to run a go loop reading the response, convert it to my format, and put it on another channel. So even though everyone seemingly uses core.async, there is no simple way to connect components together.

In contrast to that, a callback-based interface lets me directly supply a fn that puts responses onto my channel in my particular format.

I guess many Clojurians would mention "complecting things" at this point :-)

sritchie commented 9 years ago

@jwr well, I'd argue that the callback would have a very similar form to the argument to map called on a channel, yeah?

(async/map callback-fn c) is shorthand for creating that little go loop, of course.

The problem I have with layering an async abstraction on later is that the user then has to get in to async land by wrapping calls in "thread" or "future". If the underlying RethinkDB connection has support for async already via streaming responses, then you're just missing out an on opportunity to help the user out.

danielytics commented 9 years ago

As I'm looking into the connection code right now in order to fix #27, I can look at fixing this issue too, so now would be a good time to decide what to do here.

Callback, core.async channel, both or something else?

apa512 commented 9 years ago

I like channels and can't really see a situation where callbacks would be needed.

danielcompton commented 9 years ago

Callbacks are a pretty common pattern for most asynchronous operations, even in Clojure. Having both would be handy as people may not want or need to use channels for this.

sritchie commented 9 years ago

I like channels, but they're so damned easy to wrap over top of a callback design with

(fn [x](a/put! channel x)

I'd vote for callbacks to stay close to the JS library design.

— Sent from Mailbox

On Sun, May 31, 2015 at 3:47 PM, Daniel Compton notifications@github.com wrote:

Callbacks are a pretty common pattern for most asynchronous operations, even in Clojure. Having both would be handy as people may not want or need to use channels for this.

Reply to this email directly or view it on GitHub: https://github.com/apa512/clj-rethinkdb/issues/20#issuecomment-107249657

danielytics commented 9 years ago

I'm currently looking at into this, but I'm not yet sure what effect that should have on the current cursor implementation: https://github.com/apa512/clj-rethinkdb/blob/master/src/rethinkdb/net.clj#L15-L21

Currently, queries return a lazy sequence that will perform additional (continue and stop) queries to fetch more data when you consume the sequence. If you are instead consuming it through callback or core.async channel, should it do that instead of creating the lazy sequence?

danielytics commented 9 years ago

I've been playing around with this some more and I'm really unsure about how to approach the lazy sequence:

  1. Do we want to keep it as is, that is, a query returns a single result: a lazy sequence containing all of the actual results and consuming the sequence will lazily query the database for more results. This means that if you supply a callback or channel, they receive a single lazy sequence
  2. Alternatively, the query could stream each result one by one, so if you supply a channel, it would receive a value for every result and if you supply a callback, it would get called for every result
  3. Some combination of the above (eg, channels receive one result at a time, callbacks receive a lazy sequence)
  4. Something else?

For synchronous use, I think the current approach is the most seamless: query the database, get back a sequence, consume the sequence like any other and if more results need to be realized then more queries can be made in the background. (At the risk of unrealized sequences outliving the connection they use to query for more, but connection pools can fix this)

For asynchronous use, I'm not really sure what the best way is.

I also just realised that even the synchronous call could actually be asynchronous and lazy by returning a completely unrealized lazyseq that reads its data from the connection as it gets realized. Eg:

(let [my-results (r/run some-query connection)] ; Does not block, my-results not yet realized
   (println (first my-results)) ; Will block if query response not received yet, otherwise data immediately available
   (println (count my-results))) ; May perform additional queries to realize complete sequence

One solution could be to keep the current way for the existing API (if no callback supplied, return a single lazy sequence; but also add optional callback param, which, if supplied, it receives a single lazy sequence), but provide a second "streaming" API to handle streaming each result on a channel?

The "streaming" API could simply be a layer on the non-streaming API. Something like this:

(defn run-stream [query conn]
  (let [chan (async/chan)]
    (async/onto-chan chan (run query conn)) ;; If run returns immediately, then this call won't block as onto-chan runs async in its own go block
    chan))

Or maybe do as @sritchie said above and only support lazy (but async) seq and callbacks (which themselves receive a single lazyseq), since connecting to core.async channels is easy (onto-chan like above, or passing in put!/>!! as the callback as @sritchie did above).

Basically, I'm looking for some more thoughts on how this should be handled.

EDIT: correction: The current implementation only returns a lazy seq if the response is a partial success. If it is a RQL sequence, it is fully available as soon as the response comes in and if it is a single RQL value, then that value is returned. I guess that means the always async won't work because it would break the RQL value case, which would have to block until its available to not break backwards compatibility.

jwr commented 9 years ago

Well, you asked for thoughts, so here goes: I don't think I'd ever rely on laziness when working with RethinkDB. I don't even see a reasonable use case: I should know in advance what to expect and prepare my queries accordingly.

Lazy sequences introduce all kinds of complications in a reasonably-sized system, even if we're talking about Clojure's lazy sequences. You have to worry about unrealized sequences holding resources, their memory usage is significant, they make debugging harder (because things blow up in unexpected places, often not even pointing to the actual problem). The only real benefit I get from them is transducer operations.

The systems I know have plenty of doall, mapv, vec or into, introduced specifically to force evaluation and realize the full sequence. Some are there because they have to be there, some just make reasoning about the system or debugging easier.

I don't think I'd want to keep unrealized lazy sequences around in a multi-threaded web app system, knowing that they can hold on to resources I don't even know about (in RethinkDB). I'd much rather know exactly when the db is being accessed, and when resources are no longer needed.

If you consider lazy sequences of db results in a wider context of a distributed system where your db nodes can fail and you want to be able to reconnect, things get even hairier and I simply don't think I want to deal with that kind of complexity.

danielytics commented 9 years ago

Fair points. I think you are right then that it is best not to rely on lazy seqs.

Should partial success then be rewritten not to return lazy-seq (since partial success holds onto the connection and may perform further queries later)?

danielcompton commented 9 years ago

One thing to think of in this context is backpressure. We may make a very large query, but not want to realise or handle it all at once. The ability to receive a small number of results and send a CONTINUE to the server is great because we get backpressure there. Any async option would need to have the ability to apply backpressure, even if that was just giving the user directions on using a buffered core.async channel in their callback.

apa512 commented 9 years ago

Agreed. Forcing the result set to be realized immediately does not sound like a good solution, if that's what being proposed here.

danielcompton commented 9 years ago

Also, there's a very timely discussion on this very topic at https://groups.google.com/forum/#!topic/clojure/nuy2CAA89sI. At first I assumed it must have been asked by someone in this thread, but it was from someone completely unrelated.

danielytics commented 9 years ago

Hmm. Zach seems to suggest that a lazy seq is ok (I guess because you can exert back pressure by not consuming the sequence) in absence of something better and he doesn't like callbacks because there is no way to exert back pressure.

So going by this, probably leaving the lazy seqs the way they are (but perhaps look at unifying the seq/cursor interface as mentioned in #30).

As for an async interface, I suppose introducing a dependency on manifold is a bit heavy handed (and manifold isn't exactly the simplest to learn either), but perhaps callbacks can be extended somehow - eg (haven't put much thought into this, just OTOH) what if a second parameter is passed to the callback that can be used to pull more data/prevent more data from being pulled? Just throwing ideas out there...

apa512 commented 9 years ago

lazy-seq work well in my opinion (enables looping over large result sets without consuming lots of memory and only query the server as much as needed).

The main confusion seem to be that queries can return either a Cursor or a normal collection in a non-deterministic way.

I only added Cursors to be able to close unfinished queries (like cursor.close() in the official JS driver). Perhaps there's a better way to do this? The problem is that I need to associate the query with a token. with-meta didn't work because that forced the lazy-seq to be fully realized.

bhurlow commented 9 years ago

Wow lots of good ideas going on in this thread!

I'll add:

Forcing a library user to use core.async seems like a bummer. Sometimes you want synchronous code. An opt-in async option seems far preferable whether that's a callback or a channel. Perhaps an alternative to r/run conn would allow a library user to make that call. Imagine r/run-async conn could return channel. Plus, since @danielytics work #27 core.async is already a library dependency so it's here to stay.

With regards to @jwr thoughts on laziness:

You have to worry about unrealized sequences holding resources, their memory usage is significant

As @danielcompton points out, in the specific case of rethinkdb, the use of lazy sequences actually help to not store more data in memory than we might need. Requiring large queries to fully realize immediately would both take up resources and go against the grain of the db design.

So I guess I think the lazy Cursor is actually a pretty good way to model this problem. Datomic does this with it's version of a 'changes' feed here. Apart from #30 I've been able to build asynchronous abstractions around the lazy seqs without too much difficulty.

bhurlow commented 9 years ago

@apa512 I agree. Maybe a more consistent approach to the cursors coupled with more detailed documentation would solve the async problem

danielytics commented 9 years ago

Well, re: core.async, I'm not against looking at reimplementing it with agents or futures, so it doesn't have to stay a dependency (it was just the easiest way to fix the thread safety issue). Plus send-query will need to be rewritten to support asynchronous queries as this would need to change so its called in the callback/go-block at least.

Re: Cursor, what if queries always return a cursor? (well, response types 3+ - response types 1 and 2 can still return values [1]) and if needed the cursor type implements whatever other protocols may be needed to make it look like a seq (so rather than implementing clojure.lang.Seqable to convert it into a seq, like it currently does, make the cursor itself act as a seq (and still implement Seqable, but have it return itself or something?)

Forcing a library user to use core.async seems like a bummer.

I agree - I think whatever is implemented should keep backwards compatibility with the current synchronous/lazy-seq API. I was thinking that r/run takes a third argument (a callback or channel, depending on what is decided) although I'm personally also not against making it a separate function.

So I guess I think the lazy Cursor is actually a pretty good way to model this problem.

After thinking about this a bit, I think you're right. That's what database cursors are for and I think its a good fit here, which is why I'm now thinking perhaps it should always return a cursor and a cursor just happens to look and act like a lazy-seq.

EDIT: [1] Maybe it should always return a cursor, regardless of response type? For consistency, since it can be tricky for those new to RethinkDB to predict when you get one over the other. It would break backwards compatibility though....

bhurlow commented 9 years ago

Cursors vs. Full responses:

Maybe it should always return a cursor, regardless of response type? For consistency, since it can be tricky for those new to RethinkDB to predict when you get one over the other

Definitely seems to trip people up: https://github.com/apa512/clj-rethinkdb/issues/16

Third arg to run

r/run conn (chan) could be very slick! You could also supply your own domain specific channel however you like to use it

Cursors:

Right now cursors are just Closable and Seqable but they could implement other protocols as necessary. Note #30 is probably still relevant here if cursors become more first class citizens

danielytics commented 9 years ago

16 makes a good point that I hadn't considered:

That cursor additionally explodes in your face once it goes out of (with-open) connection scope.

Looks like there's a lot involved in resolving all of this.

bhurlow commented 9 years ago

Maybe the way to move forward here is to:

That way the interface is opt in and library users can pick their abstraction. Just a thought :)

danielcompton commented 9 years ago

@bhurlow I like your thoughts.

danielcompton commented 9 years ago

I've made a PR for this at #37, discussion welcome :)