ReactiveCouchbase / ReactiveCouchbase-core

Core library for ReactiveCouchbase
Apache License 2.0
64 stars 28 forks source link

Uncomplete requests for BulkGet under heavy load (and maybe Windows) #5

Closed mathieuancelin closed 9 years ago

mathieuancelin commented 10 years ago

See https://github.com/mathieuancelin/play2-couchbase/issues/33 for background by @kieranbenton

mathieuancelin commented 10 years ago

@kieranbenton I've just published a new version of 0.2 SNAPSHOT that include several config flags so you can play with it

first I noticed in your application.conf that you still use an execution-contextconfig block. Please use something like :

couchbase {  
  actorctx {
    timeout=60000 # that's important for tests, it's the timeout for couchbase operations
    ... 
  }
}

then you can play with the following flags :

couchbase {
  driver {
    useec=true # use RC execution context inside the Java driver to execute future callbacks
    blockinfutures=true # block operations inside scala futures
    block=true # block the current thread until operation is done. Works only if blockinfutures is true
    enableoperationtimeout=true # fail scala future and print a message on operations when the actorctx.timeout is reached
  }
}

You can try multiple configuration multiple time (with play start) to see if you always reproduce the issue.

I didn't manage to reproduce the issue but, given what you describe I guess that sometimes, java future callbacks are never triggered for some reason

kieranbenton commented 10 years ago

Hi Mathieu, Thanks - I'll try with those tweaks and let you know. For my own sanity what kind of environment were you running in? Windows version, x86/64, # cores and JVM version etc?

Cheers, Kieran

On 18 January 2014 18:35, Mathieu ANCELIN notifications@github.com wrote:

@kieranbenton https://github.com/kieranbenton I've just published a new version of 0.2 SNAPSHOT that include several config flags so you can play with it

first I noticed in your application.conf that you still use an execution-contextconfig block. Please use something like :

couchbase { actorctx { timeout=60000 # that's important for tests, it's the timeout for couchbase operations ... }}

then you can play with the following flags :

couchbase { driver { useec=true # use RC execution context inside the Java driver to execute future callbacks blockinfutures=true # block operations inside scala futures block=true # block the current thread until operation is done. Works only if blockinfutures is true enableoperationtimeout=true # fail scala future and print a message on operations when the actorctx.timeout is reached }}

You can try multiple configuration multiple time (with play start) to see if you always reproduce the issue.

I didn't manage to reproduce the issue but, given what you describe I guess that sometimes, java future callbacks are never triggered for some reason

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-32689058 .

mathieuancelin commented 10 years ago

Hi,

I'm running OS X 10.8.4 (no Windows @ home), on a core i5 processor (4 cores) with JDK 1.6.0_51 and JDK 1.7.0_25

kieranbenton commented 10 years ago

Hi Mathieu, That doesn't surprise me sadly - a member of our team is running OS X (older version than that) and we see no problems. Its just on Windows which is a bit bonkers. Do you want me to get a windows EC2 instance setup for you so you can see it breaking?

Cheers, Kieran

On 18 January 2014 19:54, Mathieu ANCELIN notifications@github.com wrote:

Hi,

I'm running OS X 10.8.4 (no Windows @ home), on a core i5 processor (4 cores) with JDK 1.6.0_51 and JDK 1.7.0_25

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-32691263 .

mathieuancelin commented 10 years ago

Hi,

no need for an EC2 instance right now, I'll try tomorrow on a Windows machine at work.

If you test the last version, the most important flag is enableoperationtimeout=true. If with that flag you got error messages, then I guess we can say that the issue is located.

kieranbenton commented 10 years ago

Ok great - will give that a go.

On 19 January 2014 13:00, Mathieu ANCELIN notifications@github.com wrote:

Hi,

no need for an EC2 instance right now, I'll try tomorrow on a Windows machine at work.

If you test the last version, the most important flag is enableoperationtimeout=true. If with that flag you got error messages, then I guess we can say that the issue is located.

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-32707709 .

kieranbenton commented 10 years ago

Ok - so I've updated to the new snapshot, and modified my configuration a couple of times and rerun my tests.

1) with the exact config you specified:

couchbase {
# fail Future[T] if an operation fails
failfutures=true
actorctx {
    timeout=1000
    execution-context {
        fork-join-executor {
            parallelism-factor = 20.0
            parallelism-max = 200
        }
    }
}
driver {
    useec=true # use RC execution context inside the Java driver to execute future callbacks
    blockinfutures=true # block operations inside scala futures
    block=true # block the current thread until operation is done. Works only if blockinfutures is true
    enableoperationtimeout=true # fail scala future and print a message on operations when the actorctx.timeout is reached
}
evolutions {
    disabled = false
    documents = db/couchbase
    use.locks = true
}
}

This gives me a lot of: 2014-01-19 14:10:35,863 WARN [ReactiveCouchbase] Current thread blocking operation

And much slower execution - but no sign of the 'sticking' problem. Ab tests (after running them 5x) don't seem to exhibit the problem.

2) turning off both 'blocking' options:

driver {
    useec=true # use RC execution context inside the Java driver to execute future callbacks
    blockinfutures=false # block operations inside scala futures
    block=false # block the current thread until operation is done. Works only if blockinfutures is true
    enableoperationtimeout=true # fail scala future and print a message on operations when the actorctx.timeout is reached
}

Gives me no outputted logs per request - and very good performance, but DOES create timeouts in the logs - is this essentially the problem? Ab completes within normal time limits but I suppose thats now because the library is throwing exceptions on timeouts and finishing the request?

Stack traces all look the same:

play.api.Application$$anon$1: Execution exception[[TimeoutException: null]]
    at play.api.Application$class.handleError(Application.scala:293) ~[play_2.10-2.2.1.jar:2.2.1]
    at play.api.DefaultApplication.handleError(Application.scala:399) [play_2.10-2.2.1.jar:2.2.1]
    at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$12$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:165) [play_2.10-2.2.1.jar:2.2.1]
    at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$12$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:162) [play_2.10-2.2.1.jar:2.2.1]
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) [scala-library.jar:na]
    at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) [scala-library.jar:na]
    at scala.util.Try$.apply(Try.scala:161) [scala-library.jar:na]
    at scala.util.Failure.recover(Try.scala:185) [scala-library.jar:na]
    at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:387) [scala-library.jar:na]
    at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:387) [scala-library.jar:na]
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:29) [scala-library.jar:na]
    at play.api.libs.iteratee.Execution$$anon$1.execute(Execution.scala:43) [play-iteratees_2.10-2.2.1.jar:2.2.1]
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:37) [scala-library.jar:na]
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:133) [scala-library.jar:na]
    at scala.concurrent.Promise$class.complete(Promise.scala:55) [scala-library.jar:na]
    at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:58) [scala-library.jar:na]
    at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:254) [scala-library.jar:na]
    at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:249) [scala-library.jar:na]
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:29) [scala-library.jar:na]
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) [akka-actor_2.10-2.2.0.jar:2.2.0]
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) [akka-actor_2.10-2.2.0.jar:2.2.0]
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) [akka-actor_2.10-2.2.0.jar:2.2.0]
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) [akka-actor_2.10-2.2.0.jar:2.2.0]
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) [scala-library.jar:na]
    at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) [akka-actor_2.10-2.2.0.jar:2.2.0]
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) [akka-actor_2.10-2.2.0.jar:2.2.0]
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) [akka-actor_2.10-2.2.0.jar:2.2.0]
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library.jar:na]
Caused by: java.util.concurrent.TimeoutException: null
    at org.reactivecouchbase.client.CouchbaseFutures$$anonfun$timeout$1.apply$mcV$sp(CouchbaseFutures.scala:22) ~[reactivecouchbase-core_2.10-0.2-SNAPSHOT.jar:0.2-SNAPSHOT]
    at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) ~[akka-actor_2.10-2.2.0.jar:2.2.0]

3) then upping the timeout to 5000ms

Pretty much the same as for (2) as you'd expect.

Seem to be getting closer here? :)

mathieuancelin commented 10 years ago

Ok,

your actorctx configuration is a bit rough, and 1000 milliseconds for timeout a bit low, but if when you set the blocking flags to true, then the issue is clearly either on the ReactiveCouchbase future callbacks or in the future callback mechanism of the Java driver.

Can you re-run your tests (without blocking, and with proper configuration) with the latest SNAPSHOT (just a few minutes old), and if you don't get messages like

xxxFuture didn't complete the promise, that's weird !!!'

then the issue is located in the future callback mechanism from the java driver.

giabao commented 10 years ago

Hello,

I have used CbFutureAsScala in production at my company for some time.

Recently, I found the same errors (as in this issue) with client.asyncGetBulk(..).asScala I also found an error when AbstractListenableFuture#notifyListener call executor.submit(...)

So, I try to convert some methods of MemcachedClient from using java's Future to scala's Future.

The result is class MemcachedClientScala and a test spec MemcachedClientScalaSpec.

I have not done migrating all methods of MemcachedClient. But asyncGet & asyncGetBulk is working now.

Can any of you change your test code to use MemcachedClientScala.scalaGetBulk and let me know if it is passed or not.

I am too busy now. So if anyone can fork my working and continue migrating other MemcachedClient's methods => I will very wellcome :D

Thank you.

mathieuancelin commented 10 years ago

You mean that sometimes you have request hanging forever ?

mathieuancelin commented 10 years ago

Or the double completion ?

kieranbenton commented 10 years ago

Ok, so I've:

  1. Updated my actorctx configuration section to the one specified at: https://github.com/ReactiveCouchbase/ReactiveCouchbase-play
  2. Set both of the blocking config options to 'false'
  3. Run the test app using play start and 'warmed it' by hitting the URL a couple of times

I've executed my gatling test I sent to you - and think I've still got the problem (ab also does the same thing).

Gatling shows the same weird results, 9994 successful requests out of 10,000: https://www.dropbox.com/s/nmosbcdl0ma8yij/couchsimulation-20140121174457.zip

Which tallies with the number (6) of these that I get in the logs (https://www.dropbox.com/s/2l3tpggg4sgtg8j/application.log).

Which from your message you think is a problem in the driver itself?

2014-01-21 17:45:28,622 ERROR [ReactiveCouchbase]

Timeout there !!!!

2014-01-21 17:45:28,623 ERROR [application]

! @6h2h78kab - Internal server error, for (GET) [/test] ->

play.api.Application$$anon$1: Execution exception[[TimeoutException: null]]
    at play.api.Application$class.handleError(Application.scala:293) ~[play_2.10-2.2.1.jar:2.2.1]
    at play.api.DefaultApplication.handleError(Application.scala:399) [play_2.10-2.2.1.jar:2.2.1]
    at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$12$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:165) [play_2.10-2.2.1.jar:2.2.1]
    at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$12$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:162) [play_2.10-2.2.1.jar:2.2.1]
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) [scala-library.jar:na]
    at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) [scala-library.jar:na]
    at scala.util.Try$.apply(Try.scala:161) [scala-library.jar:na]
    at scala.util.Failure.recover(Try.scala:185) [scala-library.jar:na]
    at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:387) [scala-library.jar:na]
    at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:387) [scala-library.jar:na]
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:29) [scala-library.jar:na]
    at play.api.libs.iteratee.Execution$$anon$1.execute(Execution.scala:43) [play-iteratees_2.10-2.2.1.jar:2.2.1]
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:37) [scala-library.jar:na]
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:133) [scala-library.jar:na]
    at scala.concurrent.Promise$class.complete(Promise.scala:55) [scala-library.jar:na]
    at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:58) [scala-library.jar:na]
    at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:254) [scala-library.jar:na]
    at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:249) [scala-library.jar:na]
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:29) [scala-library.jar:na]
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) [akka-actor_2.10-2.2.0.jar:2.2.0]
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) [akka-actor_2.10-2.2.0.jar:2.2.0]
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) [akka-actor_2.10-2.2.0.jar:2.2.0]
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) [akka-actor_2.10-2.2.0.jar:2.2.0]
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) [scala-library.jar:na]
    at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) [akka-actor_2.10-2.2.0.jar:2.2.0]
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) [akka-actor_2.10-2.2.0.jar:2.2.0]
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) [akka-actor_2.10-2.2.0.jar:2.2.0]
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library.jar:na]
Caused by: java.util.concurrent.TimeoutException: null
    at org.reactivecouchbase.client.CouchbaseFutures$$anonfun$timeout$1.apply$mcV$sp(CouchbaseFutures.scala:22) ~[reactivecouchbase-core_2.10-0.2-SNAPSHOT.jar:0.2-SNAPSHOT]
    at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) ~[akka-actor_2.10-2.2.0.jar:2.2.0]
    ... 6 common frames omitted
mathieuancelin commented 10 years ago

Hi @kieranbenton

that's kind of a good news actually, the Timeout there !!!! is something I added just for you, and basically it tells me that for those 4 request (and not for the 9994 others), the Java Couchbase driver never notified me that the concerned operations were success of failures.

Now I guess I can start to talk with @daschl to see what is happening and how to fix it.

Thanks for the log, it's been really helpful

daschl commented 10 years ago

@mathieuancelin let's sync up on that today so we can look for a fix in the february release. I really want to track this down if its in the java sdk!

mathieuancelin commented 10 years ago

@dashl yep, no problem

kieranbenton commented 10 years ago

Great (sort of!) - let me know if you need me to re-run any tests or need more info.

On 22 January 2014 09:03, Mathieu ANCELIN notifications@github.com wrote:

@dashl https://github.com/dashl yep, no problem

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33003614 .

daschl commented 10 years ago

@kieranbenton I also couldn't repro it locally as of now. It would be awesome if you could provide a debug logging dump of a test run which shows the issue.

I don't know which logger you've configured, but here is the guide on how to enable them all :) http://nitschinger.at/Logging-with-the-Couchbase-Java-Client

Let me know if you need any help, DEBUG (or FINEST) logging may help us get forward on this thanks very much!

kieranbenton commented 10 years ago

Hi Michael, No problem - I'll try and get that done tonight.

Cheers, Kieran

On 22 January 2014 16:36, Michael Nitschinger notifications@github.comwrote:

@kieranbenton https://github.com/kieranbenton I also couldn't repro it locally as of now. It would be awesome if you could provide a debug logging dump of a test run which shows the issue.

I don't know which logger you've configured, but here is the guide on how to enable them all :) http://nitschinger.at/Logging-with-the-Couchbase-Java-Client

Let me know if you need any help, DEBUG (or FINEST) logging may help us get forward on this thanks very much!

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33040576 .

kieranbenton commented 10 years ago

Well - this is annoying! Although probably unsurprising...

I've turned all the logging up to DEBUG (and turned off the console logging, just to disk) and hammered away with gatling. I increased the number of gatling runs for as well to try and 'soak' it - I used 100,000 requests in multiple runs.

Sadly - I now do not get any failures. So the very act of turning the logging up has hidden the problem. To confirm - I turned logging back down to WARN again and can still reproduce it at will now, 9 out of 10000 failures on my last run.

So this is going to be pretty tricky to run down!

What do you want to try next?

Cheers, Kieran

On 22 January 2014 17:39, Kieran Benton me@kieranbenton.com wrote:

Hi Michael, No problem - I'll try and get that done tonight.

Cheers, Kieran

On 22 January 2014 16:36, Michael Nitschinger notifications@github.comwrote:

@kieranbenton https://github.com/kieranbenton I also couldn't repro it locally as of now. It would be awesome if you could provide a debug logging dump of a test run which shows the issue.

I don't know which logger you've configured, but here is the guide on how to enable them all :) http://nitschinger.at/Logging-with-the-Couchbase-Java-Client

Let me know if you need any help, DEBUG (or FINEST) logging may help us get forward on this thanks very much!

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33040576 .

daschl commented 10 years ago

@kieranbenton alright :) .. so my next guess would be to try to repro this with the java client only to narrow down the pieces floating around.

Could you try to whip up a small bare-bone example with the same workload against the java sdk? You can use scala, I dont mind :) .. just eliminate the possibility of errors with reactivecouchbase, akka, play,...

If you need any help getting this up and running let me know!

kieranbenton commented 10 years ago

Lol - I'll do my best, I'm actually a C# dev at heart so you'll have to bear with me :)

Just to confirm - we're talking about using: "com.couchbase.client" %% "couchbase-client" % "1.3.1"

Correct?

On 23 January 2014 14:54, Michael Nitschinger notifications@github.comwrote:

@kieranbenton https://github.com/kieranbenton alright :) .. so my next guess would be to try to repro this with the java client only to narrow down the pieces floating around.

Could you try to whip up a small bare-bone example with the same workload against the java sdk? You can use scala, I dont mind :) .. just eliminate the possibility of errors with reactivecouchbase, akka, play,...

If you need any help getting this up and running let me know!

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33130113 .

mathieuancelin commented 10 years ago

Just one thing, to be accurate, the bare-bone example should use callbacks on Couchbase Futures and Bulk Get operations

mathieuancelin commented 10 years ago

@kieranbenton yes that's the 1.3.1 version

mathieuancelin commented 10 years ago

@kieranbenton something like that should be enough


package controllers

import play.api.mvc._
import java.net.URI
import com.couchbase.client.{CouchbaseClient, CouchbaseConnectionFactoryBuilder}
import scala.collection.mutable.ArrayBuffer
import collection.JavaConversions._
import scala.concurrent.Promise
import net.spy.memcached.internal.{BulkGetFuture, BulkGetCompletionListener}

object Application extends Controller {

  lazy val bucket = connectBucket()

  def connectBucket() = {
    val uris = ArrayBuffer(URI.create(s"http://localhost:8091/pools"))
    val cfb = new CouchbaseConnectionFactoryBuilder()
    val cf = cfb.buildCouchbaseConnection(uris, "default", "", "")
    new CouchbaseClient(cf)
  }

  def index = Action.async {
    val promise = Promise[String]()
    val future = bucket.asyncGetBulk("john-doe", "jane-doe")
    future.addListener(new BulkGetCompletionListener {
      def onComplete(p1: BulkGetFuture[_]) {
        if (p1.isDone) {
          if (p1.getStatus.isSuccess) {
            promise.trySuccess(p1.getStatus.getMessage)
          } else {
            promise.tryFailure(new RuntimeException(p1.getStatus.getMessage))
          }
        } else {
          promise.tryFailure(new RuntimeException("BulkGetFuture is not done"))
        }
      }
    })
    promise.future.map(Ok(_))
  }
}
mathieuancelin commented 10 years ago

@kieranbenton or maybe something like the following to be more aggressive


package controllers

import play.api.mvc._
import java.net.URI
import com.couchbase.client.{CouchbaseClient, CouchbaseConnectionFactoryBuilder}
import scala.collection.mutable.ArrayBuffer
import collection.JavaConversions._
import scala.concurrent.{Future, Promise}
import net.spy.memcached.internal.{BulkFuture, BulkGetFuture, BulkGetCompletionListener}
import play.api.libs.concurrent.Execution.Implicits._

object Application extends Controller {

  lazy val bucket = connectBucket()

  def connectBucket() = {
    val uris = ArrayBuffer(URI.create(s"http://localhost:8091/pools"))
    val cfb = new CouchbaseConnectionFactoryBuilder()
    val cf = cfb.buildCouchbaseConnection(uris, "default", "", "")
    new CouchbaseClient(cf)
  }

  def toFuture(future: BulkFuture[java.util.Map[String, AnyRef]]): Future[String] = {
    val promise = Promise[String]()
    future.addListener(new BulkGetCompletionListener {
      def onComplete(p1: BulkGetFuture[_]) {
        if (p1.isDone) {
          if (p1.getStatus.isSuccess) {
            promise.trySuccess(p1.getStatus.getMessage)
          } else {
            promise.tryFailure(new RuntimeException(p1.getStatus.getMessage))
          }
        } else {
          promise.tryFailure(new RuntimeException("BulkGetFuture is not done"))
        }
      }
    })
    promise.future
  }

  def index = Action.async {
    val future: Future[String] = for {
      result1 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe"))
      result2 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe"))
      result3 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe"))
      result4 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe"))
    } yield Seq(result1, result2, result3, result4).mkString(", ")
    future.map(Ok(_))
  }
}
kieranbenton commented 10 years ago

Thanks Mathieu - it would have taken me a lot longer to get there :)

https://www.dropbox.com/sh/2hgoie49jhom2pg/KCu0-YJSZr

New test case is in the zip "play2-couchbase-test.zip".

Ok, so I've managed to eliminate reactive couchbase as a dependency and its still reproducible with logging on WARN - see "1 - with logging on WARN" in that link for gatlink report and empty logs.

After a LOT of gatling test runs with DEBUG logging turned on for "com.couchbase.client" and "net.spy.memcached" - I think I managed to catch ONE failed request. I've tried since and not been able to reproduce it again sadly :(. The gatlink report and chunky logs for this run are in "2 - with logging on DEBUG".

Any clues in those logs?

Cheers, Kieran

On 23 January 2014 15:53, Mathieu ANCELIN notifications@github.com wrote:

@kieranbenton https://github.com/kieranbenton or maybe something like the following to be more aggressive

package controllers import play.api.mvc.import java.net.URIimport com.couchbase.client.{CouchbaseClient, CouchbaseConnectionFactoryBuilder}import scala.collection.mutable.ArrayBufferimport collection.JavaConversions. import scala.concurrent.{Future, Promise}import net.spy.memcached.internal.{BulkFuture, BulkGetFuture, BulkGetCompletionListener}import play.api.libs.concurrent.Execution.Implicits._

object Application extends Controller {

lazy val bucket = connectBucket()

def connectBucket() = { val uris = ArrayBuffer(URI.create(s"http://localhost:8091/pools")) val cfb = new CouchbaseConnectionFactoryBuilder() val cf = cfb.buildCouchbaseConnection(uris, "default", "", "") new CouchbaseClient(cf) }

def toFuture(future: BulkFuture[java.util.Map[String, AnyRef]]): Future[String] = {

val promise = Promise[String]()

future.addListener(new BulkGetCompletionListener {
  def onComplete(p1: BulkGetFuture[_]) {
    if (p1.isDone) {
      if (p1.getStatus.isSuccess) {
        promise.trySuccess(p1.getStatus.getMessage)
      } else {
        promise.tryFailure(new RuntimeException(p1.getStatus.getMessage))
      }
    } else {
      promise.tryFailure(new RuntimeException("BulkGetFuture is not done"))
    }
  }
})
promise.future

}

def index = Action.async { val future: Future[String] = for { result1 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe")) result2 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe")) result3 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe")) result4 <- toFuture(bucket.asyncGetBulk("john-doe", "jane-doe")) } yield Seq(result1, result2, result3, result4).mkString(", ") future.map(Ok(_)) }}

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33135963 .

daschl commented 10 years ago

@kieranbenton thanks, I'll look at them.. one thing that comes to my mind while doing it.. can you do a simple change to the code and instead of doing the bulks (just go with warning so its easier to repro) and change it to regular 1-key get()'s with listeners?

I want to track it down if its our bulk get listeners of if there is a general bug with the listeners - that would help a lot.. thanks so much for your effort in helping tracking this down!

kieranbenton commented 10 years ago

Do you mean "asyncGet(final String key)" or one of the other methods Michael? No problem.

On 23 January 2014 17:46, Michael Nitschinger notifications@github.comwrote:

@kieranbenton https://github.com/kieranbenton thanks, I'll look at them.. one thing that comes to my mind while doing it.. can you do a simple change to the code and instead of doing the bulks (just go with warning so its easier to repro) and change it to regular 1-key get()'s with listeners?

I want to track it down if its our bulk get listeners of if there is a general bug with the listeners - that would help a lot.. thanks so much for your effort in helping tracking this down!

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33148746 .

daschl commented 10 years ago

yes this one! its handled differently than the bulk..

from the debug logs I can tell that its not a bug in reading the data (we have as much reads as writes).. so the issue must be somewhere in between..

kieranbenton commented 10 years ago

Ok - I've done that:

def testGet() = Action.async {

request => { val seenKey = seen.getKey() val statusCreatedKey = status.getKey(SessionStatusType.CREATED) val statusKey = status.getKey(SessionStatusType.STATUS)

  val res = for {
    r1 <- toFuture(bucket.asyncGet(seenKey))
    r2 <- toFuture(bucket.asyncGet(statusCreatedKey))
    //r3 <- toFuture(bucket.asyncGet(statusKey))
  } yield Seq(r1, r2/*, r3*/)

  res.map(x => Ok(x.toString))

} }

And I can still reliably reproduce it (in WARN at least). I've had to comment r3 out as that key doesn't actually exist and I was getting the error condition hit in 'toFuture'. Latest run had 8 failures out of 10000.

I thought that fetching the non-existent key might be part of the reproducible case - but it looks unrelated now!

Cheers.

On 23 January 2014 17:55, Michael Nitschinger notifications@github.comwrote:

yes this one! its handled differently than the bulk..

from the debug logs I can tell that its not a bug in reading the data (we have as much reads as writes).. so the issue must be somewhere in between..

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33149700 .

daschl commented 10 years ago

wow, thats really unexpected. thanks for now, need to think about it :)

kieranbenton commented 10 years ago

Yes - very strange. I wonder why it doesn't show up on other platforms? I'm still a bit worried that I'm somehow calling the library wrong, so a quick review of my controller test code would be useful too.

Cheers.

On 23 January 2014 18:10, Michael Nitschinger notifications@github.comwrote:

wow, thats really unexpected. thanks for now, need to think about it :)

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33151266 .

daschl commented 10 years ago

I tried to repro the issue (on mac), but wasn't able to do so yet.

I "play start" ed the application and then used wrk to get it under heavy load. I can see around 45k ops/s on my node itself, and the output is:

$ wrk -t10 -c100 -d1m --latency -s count.lua  http://localhost:9000/test
Running 1m test @ http://localhost:9000/test
  10 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.42ms  376.72us  13.37ms   88.41%
    Req/Sec     1.62k   103.42     2.50k    78.99%
  Latency Distribution
     50%    6.49ms
     75%    6.62ms
     90%    6.71ms
     99%    7.07ms
  933803 requests in 1.00m, 317.03MB read
  Socket errors: connect 0, read 20, write 0, timeout 0
Requests/sec:  15563.61
Transfer/sec:      5.28MB
0

Where the "0" at the end identifies the number of http resonses > 399. Is this correct? In your test, what response code is returned for the failing ops? When I crank up the number of concurrent connections I can see read errors (but those are socket read errors, not app read errors).

I don't have a windows box handy right now, I'll update you if I have more findings.

kieranbenton commented 10 years ago

I get generalised timeouts in both ab and gatling:

18:05:47.121 [WARN ] c.e.e.g.h.a.GatlingAsyncHandler - Request 'TestRequest' failed
java.util.concurrent.TimeoutException: Request reached time out of 60000 ms after 60001 ms
    at com.ning.http.client.providers.netty.NettyAsyncHttpProvider$ReaperFuture.expire(NettyAsyncHttpProvider.java:1755) [async-http-client-1.7.21.jar:na]
    at com.ning.http.client.providers.netty.NettyAsyncHttpProvider$ReaperFuture.run(NettyAsyncHttpProvider.java:1776) [async-http-client-1.7.21.jar:na]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_40]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_40]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_40]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_40]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_40]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_40]
    at java.lang.Thread.run(Thread.java:724) [na:1.7.0_40]
18:05:47.122 [WARN ] c.e.e.g.h.a.GatlingAsyncHandlerActor - Request 'TestRequest' failed : Request reached time out of 60000 ms after 60001 ms

I could fiddle with play timeouts etc to see what play would return but I doubt that will show us anything? I could try and take a dump of my app whilst its 'stuck' - if you guys could lead me through how to do that? JVM perf debugging isn't my strong point yet sadly.

Cheers, Kieran

daschl commented 10 years ago

@kieranbenton can you help how I can setup gatling with your exact run? I never used it, would be awesome thanks :)

kieranbenton commented 10 years ago

Sure - theres a powershell script, test\run gatling.ps1 which will work on Windows, but the general commands should work x-platform:

$gatling = "C:\Program Files (x86)\Apache Software Foundation\gatling-charts-highcharts-1.5.3"
$env:GATLING_HOME = "$($gatling)"
mkdir "$($gatling)\user-files\simulations\play2couchbase"
cp  play2couchbase-gatling.scala "$($gatling)\user-files\simulations\play2couchbase"
& "$($gatling)\bin\gatling.bat"

So that copies the test case test\play2couchbase-gatling.scala into my installed gatling path - then runs gatling itself (its a command line scala app in its own right). From there you pick which simulation you want to run - so just select the one that you've just copied into the dir.

It will then run and output a directory with html based results in it.

kieranbenton commented 10 years ago

If you want to be able to change the endpoint etc, edit the test\play2couchbase-gatling.scala file - its fairly self-explanatory but let me know if you need a pointer :)

daschl commented 10 years ago

Weird (i cranked up the 50 a bit to magnify it, but still nothing)

================================================================================
2014-01-27 15:35:45                                                  10s elapsed
---- NewNudge ------------------------------------------------------------------
Users  : [#################################################################]100%
          waiting:0     / running:0     / done:200  
---- Requests ------------------------------------------------------------------
> Global                                                     OK=40000  KO=0     
> TestRequest                                                OK=40000  KO=0     
================================================================================
daschl commented 10 years ago

Maybe this is really a windows issue, looks like I need to find a windows box somewhere :)

kieranbenton commented 10 years ago

I can help with that if needs be Michael? We're going to try to narrow the test down a bit further as well (if we get time in our evenings) - as we now know its not a ReactiveCouchbase issue, going to try getting it into a command line app that you don't need gatling for.

We've got suspicions that we're starting to see it in our real app (not just the test case I've provided) on OS X too so my colleague is going to check that out.

On 27 January 2014 14:39, Michael Nitschinger notifications@github.comwrote:

Maybe this is really a windows issue, looks like I need to find a windows box somewhere :)

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33372485 .

daschl commented 10 years ago

@kieranbenton jup I'll ping you if needed, thanks!

daschl commented 10 years ago

@kieranbenton it's a wild guess, but can you try this spy jar instead of the regular one? https://dl.dropboxusercontent.com/u/10007675/spymemcached-2.10.4.jar

I might fixed a race condition here, but I can't repro so please try yourself :)

kieranbenton commented 10 years ago

Hi Michael, Can you give me a bit of a pointer? Where should I be dropping that JAR to be picked up in my ivy cache or somewhere else in the play project (do you know Mathieu?). I want to make sure I'm definitely using your updated version!

Cheers, Kieran

On 27 January 2014 16:19, Michael Nitschinger notifications@github.comwrote:

@kieranbenton https://github.com/kieranbenton it's a wild guess, but can you try this spy jar instead of the regular one? https://dl.dropboxusercontent.com/u/10007675/spymemcached-2.10.4.jar

I might fixed a race condition here, but I can't repro so please try yourself :)

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33381870 .

mathieuancelin commented 10 years ago

Hi Kieran

Yes you should copy it inside your Ivy cache and re-run your tests

kieranbenton commented 10 years ago

Sorry guys - been a busy couple of days. Should get chance to try this later.

Is there anything I can look for in the logs that will let me know im DEFINITELY using the new JAR? I don't want to be feeding you false information.

On 28 January 2014 11:50, Mathieu ANCELIN notifications@github.com wrote:

Hi Kieran

Yes you should copy it inside your Ivy cache and re-run your tests

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33471672 .

daschl commented 10 years ago

@kieranbenton hmm not really, no. But if you change it in the ivy cache it should work I hope :)

kieranbenton commented 10 years ago

Damn, I'm confident that I'm using your JAR now - but no dice, I can still reliably reproduce the problem using both ab and gatling.

On 29 January 2014 16:32, Michael Nitschinger notifications@github.comwrote:

@kieranbenton https://github.com/kieranbenton hmm not really, no. But if you change it in the ivy cache it should work I hope :)

— Reply to this email directly or view it on GitHubhttps://github.com/ReactiveCouchbase/ReactiveCouchbase-core/issues/5#issuecomment-33601465 .

mathieuancelin commented 10 years ago

Hi Kieran,

I've just updated the current 0.2-SNAPSHOT to include a 'future double checking' feature. It should fix the issue (wihtout fixing it at the java driver level of course).

Can you test it ?

you just have to update your sbt deps and add couchbase.driver.doublecheck=true to your configuration.

kieranbenton commented 10 years ago

Sorry guys, I've had a tough couple of weeks - but I'm back now to help finish off tracking this down!

@mathieuancelin with the latest snapshot and the config option set - I am getting "successful" runs using my tests where individual requests almost immediately timeout instead of hanging at the end of the run. I get the following in the logs (this is against my 'main' app since the cutdown version no longer has reactivecouchbase as a dependency):

2014-02-10 15:55:14,565 ERROR (ForkJoinPool-3-worker-7) [c.t.h.a.ExceptionHandlingAction$] Unhandled exception: request = 'GET /nudge/tm'
java.util.concurrent.TimeoutException: null
    at org.reactivecouchbase.client.CouchbaseFutures$$anonfun$timeout$1.apply$mcV$sp(CouchbaseFutures.scala:22) ~[reactivecouchbase-core_2.10-0.2-SNAPSHOT.jar:0.2-SNAPSHOT]
    at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) ~[akka-actor_2.10-2.2.0.jar:2.2.0]
    at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library.jar:na]

Is that what you're expecting?

kieranbenton commented 10 years ago

@daschl are there any next steps you'd like me to try?