eigengo / scalad

Scala Data access for NoSQL databases
47 stars 15 forks source link

Go MAD #76

Closed ghost closed 7 years ago

ghost commented 11 years ago

Not really a bug, but more a question.

Isn't the official Mongo driver blocking? Why not using an async API instead? (like MAD)

[updated title: fommil]

janm399 commented 11 years ago

@bwmcadams was supposed to be bringing MongoDB support for Slick using Hammersmith. Unfortunately, Hammersmith is still a bit too young to use in production. Wrapping the MongoDB calls in Future or pushing to an actor, both on a different Dispatcher only shifts the problem to another place.

I think it's time to see how Hammersmith would fit into Scalad, though.

fommil commented 11 years ago

scalad's API is non-blocking, in the sense that it will return immediately and you can perform foreach or map, etc that will only be evaluated when the element is available: the response type is Iterator, and the underlying producer/consumer can be configured to use various blocking/caching strategies (to avoid OOMs). Iterating through the iterator may block, depending on how fast you're processing the results.

There is no such thing as non-blocking I/O. There is always blocking at some point. Claims of non-blocking I/O in any library are just smoke and mirrors.

What non-blocking middleware really means is "an API that allows the end user to define callbacks that will be called when the data is available": i.e. asynchronous. If you want that, then just do your processing in a foreach or map and like Jan says, shove anything else into a Future... that's pretty much what every single "non-blocking" library does anyway, and I have no time for them: give me good old paged results any day, so I can control the memory usage.

fommil commented 11 years ago

btw, @janm399 check out https://groups.google.com/d/msg/scala-user/2wprKWyHAUo/3n5vInjVadAJ easily converted to a ParIterator for doing parallel processing of responses as they come in from scalad.

fommil commented 11 years ago

"async I/O" is exactly the point I am making: calling it "non-blocking" is a complete misnomer because "non-blocking" really shouldn't block but "async I/O" will always block at some layer. ScalaD is partially asynchronous in this regard:

mongo.find(query).map{_.thing}

will return instantly and the map will be called asyncronously (but one at a time). Only when you try to obtain all the results will you experience blocking in your code, e.g. with a .reduce or .toList. You can use my code (above) to get a parallel iterator which will run map in parallel, just like a List.par. (NOTE: I am not sure about foreach, it might block... depends on the Scala implementation)

Like Jan says, if you want a purely async API, then just do

Future {actor ! mongo.find(query).toList}

Back to "non-blocking": so you want libuv to block for the network results to arrive, using up a pthread instead of a Java Thread? That requires JNI since this isn't provided by the JVM, and requiring OS-specific natives, since TCP/IP is not part of the C or C++ spec (although, pthreads now are part of C++), which will incur rather a lot of data array copying unless you want to use PrimitiveArrayCritical (and I'm not sure it works in that direction).

I'd love to see performance comparisons: that's all that matters. Beyond that, it's pure coding style. If there is a performance advantage, I'm very interested, but I'd like to see the experiment and run it myself. In order to justify JNI, the performance advantage needs to be incredible.

The other fundamental flaw in Async I/O is that it throws data at you: best way to get an OOM. You need a pull based data source (like MongoDB, or paged JdbcTemplate) in order to avoid that problem, or be damn sure that you never ask for more than a few rows.

(you hit a nerve :-P... I'm sick of the "non-blocking" I/O hype)

fommil commented 11 years ago

hmm, I appear to stand corrected! :-) Although I don't understand how the response time for a SAFE INSERT can be consistently shorter than the time it takes for a PING: 50 micros vs 250 micros.

@janm399 it would appear that MAD is not so mad after all

fommil commented 11 years ago

@partycoder this ping thing is sticking with me as concerning... something stinks with the test.

Also, prompted by your recommendation to look into Java 7 NIO Async further (which I totally missed, btw, so thanks for pointing it out!), I'm not seeing any great performance benefits on the scale you're talking about. This is the highest hitting google search on the subject: http://vanillajava.blogspot.co.uk/2011/08/comparing-java-7-async-nio-with-nio.html

And IBM also point out that magic operating system support for "non-blocking IO" is not always used: http://www.ibm.com/developerworks/java/library/j-nio2-1/

"Each asynchronous channel constructed belongs to a channel group that shares a pool of Java threads, which are used for handling the completion of initiated asynchronous I/O operations. This might sound like a bit of a cheat, because you could implement most of the asynchronous functionality yourself in Java threads to get the same behaviour, and you'd hope that NIO.2 could be implemented purely using the operating system's asynchronous I/O capabilities for better performance. However, in some cases, it's necessary to use Java threads: for instance, the completion-handler methods are guaranteed to be executed on threads from the pool."

So, basically IBM are admitting that sometimes Thread is used and effectively pthread is used when it leaves the JVM. So there is still blocking somewhere :-P

Also, I've seen these sorts of things before: http://www.techempower.com/benchmarks/ but this is more a framework comparison than a "sync vs async" performance test. Ideally, one wants to test a clean PING without the framework (as the author does above).

The async hype does appear to still be mostly hype, with a little bit of promise, but if your performance charts are anything to go by: the MongoDB driver just really stinks! (we had our suspicions from the code quality of the BSON layer, if we're being honest).

fommil commented 11 years ago

@partycoder I know how to set the durability level, what is concerning is that the response time for a high durability level is much quicker than PING, so can these results really be believed?

fommil commented 11 years ago

@partycoder I think this should remain open as an RFE to swap to MAD.

But I still don't trust these perf tests: 1/5 the speed of PING? Something is wrong there: maybe he meant micro for ping, not milli.

I'll certainly have to invstigate further to understand the real benefits of the new Java 7 async IO. Thanks for bringing it up!

fommil commented 7 years ago

I had completely forgotten that this library existed so you're almost certainly better off going with whatever the latest and greatest is. scalad was always just a wrapper layer anyway, not a driver replacement, and I've heard reactive mongo is very good. I never did believe those MAD numbers...