Closed mathieuancelin closed 10 years ago
@kieranbenton I've just published a new version of 0.2 SNAPSHOT that include several config flags so you can play with it
first I noticed in your application.conf that you still use an execution-context
config block.
Please use something like :
couchbase {
actorctx {
timeout=60000 # that's important for tests, it's the timeout for couchbase operations
...
}
}
then you can play with the following flags :
couchbase {
driver {
useec=true # use RC execution context inside the Java driver to execute future callbacks
blockinfutures=true # block operations inside scala futures
block=true # block the current thread until operation is done. Works only if blockinfutures is true
enableoperationtimeout=true # fail scala future and print a message on operations when the actorctx.timeout is reached
}
}
You can try multiple configuration multiple time (with play start
) to see if you always reproduce the issue.
I didn't manage to reproduce the issue but, given what you describe I guess that sometimes, java future callbacks are never triggered for some reason
Hi Mathieu, Thanks - I'll try with those tweaks and let you know. For my own sanity what kind of environment were you running in? Windows version, x86/64, # cores and JVM version etc?
Cheers, Kieran
On 18 January 2014 18:35, Mathieu ANCELIN notifications@github.com wrote:
@kieranbenton https://github.com/kieranbenton I've just published a new version of 0.2 SNAPSHOT that include several config flags so you can play with it
first I noticed in your application.conf that you still use an execution-contextconfig block. Please use something like :
couchbase { actorctx { timeout=60000 # that's important for tests, it's the timeout for couchbase operations ... }}
then you can play with the following flags :
couchbase { driver { useec=true # use RC execution context inside the Java driver to execute future callbacks blockinfutures=true # block operations inside scala futures block=true # block the current thread until operation is done. Works only if blockinfutures is true enableoperationtimeout=true # fail scala future and print a message on operations when the actorctx.timeout is reached }}
You can try multiple configuration multiple time (with play start) to see if you always reproduce the issue.
I didn't manage to reproduce the issue but, given what you describe I guess that sometimes, java future callbacks are never triggered for some reason
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-32689058 .
Hi,
I'm running OS X 10.8.4 (no Windows @ home), on a core i5 processor (4 cores) with JDK 1.6.0_51 and JDK 1.7.0_25
Hi Mathieu, That doesn't surprise me sadly - a member of our team is running OS X (older version than that) and we see no problems. Its just on Windows which is a bit bonkers. Do you want me to get a windows EC2 instance setup for you so you can see it breaking?
Cheers, Kieran
On 18 January 2014 19:54, Mathieu ANCELIN notifications@github.com wrote:
Hi,
I'm running OS X 10.8.4 (no Windows @ home), on a core i5 processor (4 cores) with JDK 1.6.0_51 and JDK 1.7.0_25
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-32691263 .
Hi,
no need for an EC2 instance right now, I'll try tomorrow on a Windows machine at work.
If you test the last version, the most important flag is enableoperationtimeout=true
.
If with that flag you got error messages, then I guess we can say that the issue is located.
Ok great - will give that a go.
On 19 January 2014 13:00, Mathieu ANCELIN notifications@github.com wrote:
Hi,
no need for an EC2 instance right now, I'll try tomorrow on a Windows machine at work.
If you test the last version, the most important flag is enableoperationtimeout=true. If with that flag you got error messages, then I guess we can say that the issue is located.
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-32707709 .
Ok - so I've updated to the new snapshot, and modified my configuration a couple of times and rerun my tests.
1) with the exact config you specified:
couchbase {
# fail Future[T] if an operation fails
failfutures=true
actorctx {
timeout=1000
execution-context {
fork-join-executor {
parallelism-factor = 20.0
parallelism-max = 200
}
}
}
driver {
useec=true # use RC execution context inside the Java driver to execute future callbacks
blockinfutures=true # block operations inside scala futures
block=true # block the current thread until operation is done. Works only if blockinfutures is true
enableoperationtimeout=true # fail scala future and print a message on operations when the actorctx.timeout is reached
}
evolutions {
disabled = false
documents = db/couchbase
use.locks = true
}
}
This gives me a lot of:
2014-01-19 14:10:35,863 WARN [ReactiveCouchbase] Current thread blocking operation
And much slower execution - but no sign of the 'sticking' problem. Ab tests (after running them 5x) don't seem to exhibit the problem.
2) turning off both 'blocking' options:
driver {
useec=true # use RC execution context inside the Java driver to execute future callbacks
blockinfutures=false # block operations inside scala futures
block=false # block the current thread until operation is done. Works only if blockinfutures is true
enableoperationtimeout=true # fail scala future and print a message on operations when the actorctx.timeout is reached
}
Gives me no outputted logs per request - and very good performance, but DOES create timeouts in the logs - is this essentially the problem? Ab completes within normal time limits but I suppose thats now because the library is throwing exceptions on timeouts and finishing the request?
Stack traces all look the same:
play.api.Application$$anon$1: Execution exception[[TimeoutException: null]]
at play.api.Application$class.handleError(Application.scala:293) ~[play_2.10-2.2.1.jar:2.2.1]
at play.api.DefaultApplication.handleError(Application.scala:399) [play_2.10-2.2.1.jar:2.2.1]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$12$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:165) [play_2.10-2.2.1.jar:2.2.1]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$12$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:162) [play_2.10-2.2.1.jar:2.2.1]
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) [scala-library.jar:na]
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) [scala-library.jar:na]
at scala.util.Try$.apply(Try.scala:161) [scala-library.jar:na]
at scala.util.Failure.recover(Try.scala:185) [scala-library.jar:na]
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:387) [scala-library.jar:na]
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:387) [scala-library.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:29) [scala-library.jar:na]
at play.api.libs.iteratee.Execution$$anon$1.execute(Execution.scala:43) [play-iteratees_2.10-2.2.1.jar:2.2.1]
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:37) [scala-library.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:133) [scala-library.jar:na]
at scala.concurrent.Promise$class.complete(Promise.scala:55) [scala-library.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:58) [scala-library.jar:na]
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:254) [scala-library.jar:na]
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:249) [scala-library.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:29) [scala-library.jar:na]
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) [akka-actor_2.10-2.2.0.jar:2.2.0]
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) [akka-actor_2.10-2.2.0.jar:2.2.0]
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) [akka-actor_2.10-2.2.0.jar:2.2.0]
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) [akka-actor_2.10-2.2.0.jar:2.2.0]
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) [scala-library.jar:na]
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) [akka-actor_2.10-2.2.0.jar:2.2.0]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) [akka-actor_2.10-2.2.0.jar:2.2.0]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) [akka-actor_2.10-2.2.0.jar:2.2.0]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library.jar:na]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library.jar:na]
Caused by: java.util.concurrent.TimeoutException: null
at org.reactivecouchbase.client.CouchbaseFutures$$anonfun$timeout$1.apply$mcV$sp(CouchbaseFutures.scala:22) ~[reactivecouchbase-core_2.10-0.2-SNAPSHOT.jar:0.2-SNAPSHOT]
at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) ~[akka-actor_2.10-2.2.0.jar:2.2.0]
3) then upping the timeout to 5000ms
Pretty much the same as for (2) as you'd expect.
Seem to be getting closer here? :)
Ok,
your actorctx
configuration is a bit rough, and 1000 milliseconds for timeout a bit low, but if when you set the blocking flags to true, then the issue is clearly either on the ReactiveCouchbase future callbacks or in the future callback mechanism of the Java driver.
Can you re-run your tests (without blocking, and with proper configuration) with the latest SNAPSHOT (just a few minutes old), and if you don't get messages like
xxxFuture didn't complete the promise, that's weird !!!'
then the issue is located in the future callback mechanism from the java driver.
Hello,
I have used CbFutureAsScala in production at my company for some time.
Recently, I found the same errors (as in this issue) with client.asyncGetBulk(..).asScala I also found an error when AbstractListenableFuture#notifyListener call executor.submit(...)
So, I try to convert some methods of MemcachedClient from using java's Future to scala's Future.
The result is class MemcachedClientScala and a test spec MemcachedClientScalaSpec.
I have not done migrating all methods of MemcachedClient. But asyncGet & asyncGetBulk is working now.
Can any of you change your test code to use MemcachedClientScala.scalaGetBulk and let me know if it is passed or not.
I am too busy now. So if anyone can fork my working and continue migrating other MemcachedClient's methods => I will very wellcome :D
Thank you.
You mean that sometimes you have request hanging forever ?
Or the double completion ?
Ok, so I've:
I've executed my gatling test I sent to you - and think I've still got the problem (ab also does the same thing).
Gatling shows the same weird results, 9994 successful requests out of 10,000: https://www.dropbox.com/s/nmosbcdl0ma8yij/couchsimulation-20140121174457.zip
Which tallies with the number (6) of these that I get in the logs (https://www.dropbox.com/s/2l3tpggg4sgtg8j/application.log).
Which from your message you think is a problem in the driver itself?
2014-01-21 17:45:28,622 ERROR [ReactiveCouchbase]
Timeout there !!!!
2014-01-21 17:45:28,623 ERROR [application]
! @6h2h78kab - Internal server error, for (GET) [/test] ->
play.api.Application$$anon$1: Execution exception[[TimeoutException: null]]
at play.api.Application$class.handleError(Application.scala:293) ~[play_2.10-2.2.1.jar:2.2.1]
at play.api.DefaultApplication.handleError(Application.scala:399) [play_2.10-2.2.1.jar:2.2.1]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$12$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:165) [play_2.10-2.2.1.jar:2.2.1]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$12$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:162) [play_2.10-2.2.1.jar:2.2.1]
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) [scala-library.jar:na]
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) [scala-library.jar:na]
at scala.util.Try$.apply(Try.scala:161) [scala-library.jar:na]
at scala.util.Failure.recover(Try.scala:185) [scala-library.jar:na]
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:387) [scala-library.jar:na]
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:387) [scala-library.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:29) [scala-library.jar:na]
at play.api.libs.iteratee.Execution$$anon$1.execute(Execution.scala:43) [play-iteratees_2.10-2.2.1.jar:2.2.1]
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:37) [scala-library.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:133) [scala-library.jar:na]
at scala.concurrent.Promise$class.complete(Promise.scala:55) [scala-library.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:58) [scala-library.jar:na]
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:254) [scala-library.jar:na]
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:249) [scala-library.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:29) [scala-library.jar:na]
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) [akka-actor_2.10-2.2.0.jar:2.2.0]
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) [akka-actor_2.10-2.2.0.jar:2.2.0]
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) [akka-actor_2.10-2.2.0.jar:2.2.0]
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) [akka-actor_2.10-2.2.0.jar:2.2.0]
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) [scala-library.jar:na]
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) [akka-actor_2.10-2.2.0.jar:2.2.0]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) [akka-actor_2.10-2.2.0.jar:2.2.0]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) [akka-actor_2.10-2.2.0.jar:2.2.0]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library.jar:na]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library.jar:na]
Caused by: java.util.concurrent.TimeoutException: null
at org.reactivecouchbase.client.CouchbaseFutures$$anonfun$timeout$1.apply$mcV$sp(CouchbaseFutures.scala:22) ~[reactivecouchbase-core_2.10-0.2-SNAPSHOT.jar:0.2-SNAPSHOT]
at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) ~[akka-actor_2.10-2.2.0.jar:2.2.0]
... 6 common frames omitted
Hi @kieranbenton
that's kind of a good news actually, the Timeout there !!!!
is something I added just for you, and basically it tells me that for those 4 request (and not for the 9994 others), the Java Couchbase driver never notified me that the concerned operations were success of failures.
Now I guess I can start to talk with @daschl to see what is happening and how to fix it.
Thanks for the log, it's been really helpful
@mathieuancelin let's sync up on that today so we can look for a fix in the february release. I really want to track this down if its in the java sdk!
@dashl yep, no problem
Great (sort of!) - let me know if you need me to re-run any tests or need more info.
On 22 January 2014 09:03, Mathieu ANCELIN notifications@github.com wrote:
@dashl https://github.com/dashl yep, no problem
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33003614 .
@kieranbenton I also couldn't repro it locally as of now. It would be awesome if you could provide a debug logging dump of a test run which shows the issue.
I don't know which logger you've configured, but here is the guide on how to enable them all :) http://nitschinger.at/Logging-with-the-Couchbase-Java-Client
Let me know if you need any help, DEBUG (or FINEST) logging may help us get forward on this thanks very much!
Hi Michael, No problem - I'll try and get that done tonight.
Cheers, Kieran
On 22 January 2014 16:36, Michael Nitschinger notifications@github.comwrote:
@kieranbenton https://github.com/kieranbenton I also couldn't repro it locally as of now. It would be awesome if you could provide a debug logging dump of a test run which shows the issue.
I don't know which logger you've configured, but here is the guide on how to enable them all :) http://nitschinger.at/Logging-with-the-Couchbase-Java-Client
Let me know if you need any help, DEBUG (or FINEST) logging may help us get forward on this thanks very much!
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33040576 .
Well - this is annoying! Although probably unsurprising...
I've turned all the logging up to DEBUG (and turned off the console logging, just to disk) and hammered away with gatling. I increased the number of gatling runs for as well to try and 'soak' it - I used 100,000 requests in multiple runs.
Sadly - I now do not get any failures. So the very act of turning the logging up has hidden the problem. To confirm - I turned logging back down to WARN again and can still reproduce it at will now, 9 out of 10000 failures on my last run.
So this is going to be pretty tricky to run down!
What do you want to try next?
Cheers, Kieran
On 22 January 2014 17:39, Kieran Benton me@kieranbenton.com wrote:
Hi Michael, No problem - I'll try and get that done tonight.
Cheers, Kieran
On 22 January 2014 16:36, Michael Nitschinger notifications@github.comwrote:
@kieranbenton https://github.com/kieranbenton I also couldn't repro it locally as of now. It would be awesome if you could provide a debug logging dump of a test run which shows the issue.
I don't know which logger you've configured, but here is the guide on how to enable them all :) http://nitschinger.at/Logging-with-the-Couchbase-Java-Client
Let me know if you need any help, DEBUG (or FINEST) logging may help us get forward on this thanks very much!
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33040576 .
@kieranbenton alright :) .. so my next guess would be to try to repro this with the java client only to narrow down the pieces floating around.
Could you try to whip up a small bare-bone example with the same workload against the java sdk? You can use scala, I dont mind :) .. just eliminate the possibility of errors with reactivecouchbase, akka, play,...
If you need any help getting this up and running let me know!
Lol - I'll do my best, I'm actually a C# dev at heart so you'll have to bear with me :)
Just to confirm - we're talking about using: "com.couchbase.client" %% "couchbase-client" % "1.3.1"
Correct?
On 23 January 2014 14:54, Michael Nitschinger notifications@github.comwrote:
@kieranbenton https://github.com/kieranbenton alright :) .. so my next guess would be to try to repro this with the java client only to narrow down the pieces floating around.
Could you try to whip up a small bare-bone example with the same workload against the java sdk? You can use scala, I dont mind :) .. just eliminate the possibility of errors with reactivecouchbase, akka, play,...
If you need any help getting this up and running let me know!
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33130113 .
Just one thing, to be accurate, the bare-bone example should use callbacks on Couchbase Futures and Bulk Get operations
@kieranbenton yes that's the 1.3.1 version
@kieranbenton something like that should be enough
package controllers
import play.api.mvc._
import java.net.URI
import com.couchbase.client.{CouchbaseClient, CouchbaseConnectionFactoryBuilder}
import scala.collection.mutable.ArrayBuffer
import collection.JavaConversions._
import scala.concurrent.Promise
import net.spy.memcached.internal.{BulkGetFuture, BulkGetCompletionListener}
object Application extends Controller {
lazy val bucket = connectBucket()
def connectBucket() = {
val uris = ArrayBuffer(URI.create(s"http://localhost:8091/pools"))
val cfb = new CouchbaseConnectionFactoryBuilder()
val cf = cfb.buildCouchbaseConnection(uris, "default", "", "")
new CouchbaseClient(cf)
}
def index = Action.async {
val promise = Promise[String]()
val future = bucket.asyncGetBulk("john-doe", "jane-doe")
future.addListener(new BulkGetCompletionListener {
def onComplete(p1: BulkGetFuture[_]) {
if (p1.isDone) {
if (p1.getStatus.isSuccess) {
promise.trySuccess(p1.getStatus.getMessage)
} else {
promise.tryFailure(new RuntimeException(p1.getStatus.getMessage))
}
} else {
promise.tryFailure(new RuntimeException("BulkGetFuture is not done"))
}
}
})
promise.future.map(Ok(_))
}
}
@kieranbenton or maybe something like the following to be more aggressive
package controllers
import play.api.mvc._
import java.net.URI
import com.couchbase.client.{CouchbaseClient, CouchbaseConnectionFactoryBuilder}
import scala.collection.mutable.ArrayBuffer
import collection.JavaConversions._
import scala.concurrent.{Future, Promise}
import net.spy.memcached.internal.{BulkFuture, BulkGetFuture, BulkGetCompletionListener}
import play.api.libs.concurrent.Execution.Implicits._
object Application extends Controller {
lazy val bucket = connectBucket()
def connectBucket() = {
val uris = ArrayBuffer(URI.create(s"http://localhost:8091/pools"))
val cfb = new CouchbaseConnectionFactoryBuilder()
val cf = cfb.buildCouchbaseConnection(uris, "default", "", "")
new CouchbaseClient(cf)
}
def toFuture(future: BulkFuture[java.util.Map[String, AnyRef]]): Future[String] = {
val promise = Promise[String]()
future.addListener(new BulkGetCompletionListener {
def onComplete(p1: BulkGetFuture[_]) {
if (p1.isDone) {
if (p1.getStatus.isSuccess) {
promise.trySuccess(p1.getStatus.getMessage)
} else {
promise.tryFailure(new RuntimeException(p1.getStatus.getMessage))
}
} else {
promise.tryFailure(new RuntimeException("BulkGetFuture is not done"))
}
}
})
promise.future
}
def index = Action.async {
val future: Future[String] = for {
result1 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe"))
result2 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe"))
result3 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe"))
result4 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe"))
} yield Seq(result1, result2, result3, result4).mkString(", ")
future.map(Ok(_))
}
}
Thanks Mathieu - it would have taken me a lot longer to get there :)
https://www.dropbox.com/sh/2hgoie49jhom2pg/KCu0-YJSZr
New test case is in the zip "play2-couchbase-test.zip".
Ok, so I've managed to eliminate reactive couchbase as a dependency and its still reproducible with logging on WARN - see "1 - with logging on WARN" in that link for gatlink report and empty logs.
After a LOT of gatling test runs with DEBUG logging turned on for "com.couchbase.client" and "net.spy.memcached" - I think I managed to catch ONE failed request. I've tried since and not been able to reproduce it again sadly :(. The gatlink report and chunky logs for this run are in "2 - with logging on DEBUG".
Any clues in those logs?
Cheers, Kieran
On 23 January 2014 15:53, Mathieu ANCELIN notifications@github.com wrote:
@kieranbenton https://github.com/kieranbenton or maybe something like the following to be more aggressive
package controllers import play.api.mvc.import java.net.URIimport com.couchbase.client.{CouchbaseClient, CouchbaseConnectionFactoryBuilder}import scala.collection.mutable.ArrayBufferimport collection.JavaConversions. import scala.concurrent.{Future, Promise}import net.spy.memcached.internal.{BulkFuture, BulkGetFuture, BulkGetCompletionListener}import play.api.libs.concurrent.Execution.Implicits._
object Application extends Controller {
lazy val bucket = connectBucket()
def connectBucket() = { val uris = ArrayBuffer(URI.create(s"http://localhost:8091/pools")) val cfb = new CouchbaseConnectionFactoryBuilder() val cf = cfb.buildCouchbaseConnection(uris, "default", "", "") new CouchbaseClient(cf) }
def toFuture(future: BulkFuture[java.util.Map[String, AnyRef]]): Future[String] = {
val promise = Promise[String]() future.addListener(new BulkGetCompletionListener { def onComplete(p1: BulkGetFuture[_]) { if (p1.isDone) { if (p1.getStatus.isSuccess) { promise.trySuccess(p1.getStatus.getMessage) } else { promise.tryFailure(new RuntimeException(p1.getStatus.getMessage)) } } else { promise.tryFailure(new RuntimeException("BulkGetFuture is not done")) } } }) promise.future
}
def index = Action.async { val future: Future[String] = for { result1 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe")) result2 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe")) result3 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe")) result4 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe")) } yield Seq(result1, result2, result3, result4).mkString(", ") future.map(Ok(_)) }}
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33135963 .
@kieranbenton thanks, I'll look at them.. one thing that comes to my mind while doing it.. can you do a simple change to the code and instead of doing the bulks (just go with warning so its easier to repro) and change it to regular 1-key get()'s with listeners?
I want to track it down if its our bulk get listeners of if there is a general bug with the listeners - that would help a lot.. thanks so much for your effort in helping tracking this down!
Do you mean "asyncGet(final String key)" or one of the other methods Michael? No problem.
On 23 January 2014 17:46, Michael Nitschinger notifications@github.comwrote:
@kieranbenton https://github.com/kieranbenton thanks, I'll look at them.. one thing that comes to my mind while doing it.. can you do a simple change to the code and instead of doing the bulks (just go with warning so its easier to repro) and change it to regular 1-key get()'s with listeners?
I want to track it down if its our bulk get listeners of if there is a general bug with the listeners - that would help a lot.. thanks so much for your effort in helping tracking this down!
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33148746 .
yes this one! its handled differently than the bulk..
from the debug logs I can tell that its not a bug in reading the data (we have as much reads as writes).. so the issue must be somewhere in between..
Ok - I've done that:
def testGet() = Action.async {
request => { val seenKey = seen.getKey() val statusCreatedKey = status.getKey(SessionStatusType.CREATED) val statusKey = status.getKey(SessionStatusType.STATUS)
val res = for {
r1 <- toFuture(bucket.asyncGet(seenKey))
r2 <- toFuture(bucket.asyncGet(statusCreatedKey))
//r3 <- toFuture(bucket.asyncGet(statusKey))
} yield Seq(r1, r2/*, r3*/)
res.map(x => Ok(x.toString))
} }
And I can still reliably reproduce it (in WARN at least). I've had to comment r3 out as that key doesn't actually exist and I was getting the error condition hit in 'toFuture'. Latest run had 8 failures out of 10000.
I thought that fetching the non-existent key might be part of the reproducible case - but it looks unrelated now!
Cheers.
On 23 January 2014 17:55, Michael Nitschinger notifications@github.comwrote:
yes this one! its handled differently than the bulk..
from the debug logs I can tell that its not a bug in reading the data (we have as much reads as writes).. so the issue must be somewhere in between..
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33149700 .
wow, thats really unexpected. thanks for now, need to think about it :)
Yes - very strange. I wonder why it doesn't show up on other platforms? I'm still a bit worried that I'm somehow calling the library wrong, so a quick review of my controller test code would be useful too.
Cheers.
On 23 January 2014 18:10, Michael Nitschinger notifications@github.comwrote:
wow, thats really unexpected. thanks for now, need to think about it :)
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33151266 .
I tried to repro the issue (on mac), but wasn't able to do so yet.
I "play start" ed the application and then used wrk to get it under heavy load. I can see around 45k ops/s on my node itself, and the output is:
$ wrk -t10 -c100 -d1m --latency -s count.lua http://localhost:9000/test
Running 1m test @ http://localhost:9000/test
10 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.42ms 376.72us 13.37ms 88.41%
Req/Sec 1.62k 103.42 2.50k 78.99%
Latency Distribution
50% 6.49ms
75% 6.62ms
90% 6.71ms
99% 7.07ms
933803 requests in 1.00m, 317.03MB read
Socket errors: connect 0, read 20, write 0, timeout 0
Requests/sec: 15563.61
Transfer/sec: 5.28MB
0
Where the "0" at the end identifies the number of http resonses > 399. Is this correct? In your test, what response code is returned for the failing ops? When I crank up the number of concurrent connections I can see read errors (but those are socket read errors, not app read errors).
I don't have a windows box handy right now, I'll update you if I have more findings.
I get generalised timeouts in both ab and gatling:
18:05:47.121 [WARN ] c.e.e.g.h.a.GatlingAsyncHandler - Request 'TestRequest' failed
java.util.concurrent.TimeoutException: Request reached time out of 60000 ms after 60001 ms
at com.ning.http.client.providers.netty.NettyAsyncHttpProvider$ReaperFuture.expire(NettyAsyncHttpProvider.java:1755) [async-http-client-1.7.21.jar:na]
at com.ning.http.client.providers.netty.NettyAsyncHttpProvider$ReaperFuture.run(NettyAsyncHttpProvider.java:1776) [async-http-client-1.7.21.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_40]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_40]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_40]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_40]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_40]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_40]
at java.lang.Thread.run(Thread.java:724) [na:1.7.0_40]
18:05:47.122 [WARN ] c.e.e.g.h.a.GatlingAsyncHandlerActor - Request 'TestRequest' failed : Request reached time out of 60000 ms after 60001 ms
I could fiddle with play timeouts etc to see what play would return but I doubt that will show us anything? I could try and take a dump of my app whilst its 'stuck' - if you guys could lead me through how to do that? JVM perf debugging isn't my strong point yet sadly.
Cheers, Kieran
@kieranbenton can you help how I can setup gatling with your exact run? I never used it, would be awesome thanks :)
Sure - theres a powershell script, test\run gatling.ps1
which will work on Windows, but the general commands should work x-platform:
$gatling = "C:\Program Files (x86)\Apache Software Foundation\gatling-charts-highcharts-1.5.3"
$env:GATLING_HOME = "$($gatling)"
mkdir "$($gatling)\user-files\simulations\play2couchbase"
cp play2couchbase-gatling.scala "$($gatling)\user-files\simulations\play2couchbase"
& "$($gatling)\bin\gatling.bat"
So that copies the test case test\play2couchbase-gatling.scala
into my installed gatling path - then runs gatling itself (its a command line scala app in its own right). From there you pick which simulation you want to run - so just select the one that you've just copied into the dir.
It will then run and output a directory with html based results in it.
If you want to be able to change the endpoint etc, edit the test\play2couchbase-gatling.scala
file - its fairly self-explanatory but let me know if you need a pointer :)
Weird (i cranked up the 50 a bit to magnify it, but still nothing)
================================================================================
2014-01-27 15:35:45 10s elapsed
---- NewNudge ------------------------------------------------------------------
Users : [#################################################################]100%
waiting:0 / running:0 / done:200
---- Requests ------------------------------------------------------------------
> Global OK=40000 KO=0
> TestRequest OK=40000 KO=0
================================================================================
Maybe this is really a windows issue, looks like I need to find a windows box somewhere :)
I can help with that if needs be Michael? We're going to try to narrow the test down a bit further as well (if we get time in our evenings) - as we now know its not a ReactiveCouchbase issue, going to try getting it into a command line app that you don't need gatling for.
We've got suspicions that we're starting to see it in our real app (not just the test case I've provided) on OS X too so my colleague is going to check that out.
On 27 January 2014 14:39, Michael Nitschinger notifications@github.comwrote:
Maybe this is really a windows issue, looks like I need to find a windows box somewhere :)
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33372485 .
@kieranbenton jup I'll ping you if needed, thanks!
@kieranbenton it's a wild guess, but can you try this spy jar instead of the regular one? https://dl.dropboxusercontent.com/u/10007675/spymemcached-2.10.4.jar
I might fixed a race condition here, but I can't repro so please try yourself :)
Hi Michael, Can you give me a bit of a pointer? Where should I be dropping that JAR to be picked up in my ivy cache or somewhere else in the play project (do you know Mathieu?). I want to make sure I'm definitely using your updated version!
Cheers, Kieran
On 27 January 2014 16:19, Michael Nitschinger notifications@github.comwrote:
@kieranbenton https://github.com/kieranbenton it's a wild guess, but can you try this spy jar instead of the regular one? https://dl.dropboxusercontent.com/u/10007675/spymemcached-2.10.4.jar
I might fixed a race condition here, but I can't repro so please try yourself :)
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33381870 .
Hi Kieran
Yes you should copy it inside your Ivy cache and re-run your tests
Sorry guys - been a busy couple of days. Should get chance to try this later.
Is there anything I can look for in the logs that will let me know im DEFINITELY using the new JAR? I don't want to be feeding you false information.
On 28 January 2014 11:50, Mathieu ANCELIN notifications@github.com wrote:
Hi Kieran
Yes you should copy it inside your Ivy cache and re-run your tests
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33471672 .
@kieranbenton hmm not really, no. But if you change it in the ivy cache it should work I hope :)
Damn, I'm confident that I'm using your JAR now - but no dice, I can still reliably reproduce the problem using both ab and gatling.
On 29 January 2014 16:32, Michael Nitschinger notifications@github.comwrote:
@kieranbenton https://github.com/kieranbenton hmm not really, no. But if you change it in the ivy cache it should work I hope :)
— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33601465 .
Hi Kieran,
I've just updated the current 0.2-SNAPSHOT to include a 'future double checking' feature. It should fix the issue (wihtout fixing it at the java driver level of course).
Can you test it ?
you just have to update your sbt deps and add couchbase.driver.doublecheck=true
to your configuration.
Sorry guys, I've had a tough couple of weeks - but I'm back now to help finish off tracking this down!
@mathieuancelin with the latest snapshot and the config option set - I am getting "successful" runs using my tests where individual requests almost immediately timeout instead of hanging at the end of the run. I get the following in the logs (this is against my 'main' app since the cutdown version no longer has reactivecouchbase as a dependency):
2014-02-10 15:55:14,565 ERROR (ForkJoinPool-3-worker-7) [c.t.h.a.ExceptionHandlingAction$] Unhandled exception: request = 'GET /nudge/tm'
java.util.concurrent.TimeoutException: null
at org.reactivecouchbase.client.CouchbaseFutures$$anonfun$timeout$1.apply$mcV$sp(CouchbaseFutures.scala:22) ~[reactivecouchbase-core_2.10-0.2-SNAPSHOT.jar:0.2-SNAPSHOT]
at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) ~[akka-actor_2.10-2.2.0.jar:2.2.0]
at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) [scala-library.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library.jar:na]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library.jar:na]
Is that what you're expecting?
@daschl are there any next steps you'd like me to try?
See https://github.com/mathieuancelin/play2-couchbase/issues/33 for background by @kieranbenton