Growing number of driver processes

bzz commented 6 years ago

While getting UASTs and filtering for identifiers for python files of a single project using Engine, after 30min I can see 350+ driver processes inside the bblfshd container

Logs in details

`ps` ``` root 2169 0.0 0.0 18188 3188 pts/0 Ss 08:43 0:00 bash root 2177 0.0 0.0 0 0 ? Z 08:43 0:00 [runc:[1:CHILD]] root 2203 0.0 0.0 0 0 ? Z 08:43 0:00 [runc:[1:CHILD]] root 2228 0.0 0.0 0 0 ? Z 08:43 0:00 [runc:[1:CHILD]] root 2249 0.0 0.0 0 0 ? Z 08:43 0:00 [runc:[1:CHILD]] root 2269 0.0 0.0 0 0 ? Z 08:44 0:00 [runc:[1:CHILD]] root 2473 0.0 0.0 0 0 ? Z 08:44 0:00 [runc:[1:CHILD]] root 2561 0.0 0.0 0 0 ? Z 08:44 0:00 [runc:[1:CHILD]] root 2562 28.4 0.7 36036 28552 ? Ssl 08:44 0:01 /opt/driver/bin/driver --log-level info --log-format text - root 2572 30.7 0.6 81092 27848 ? S 08:44 0:02 /usr/bin/python3.6 /usr/bin/python_driver ``` Container log ``` time="2017-11-16T08:48:35Z" level=info msg="python-driver version: dev-1908ca8 (build: 2017-11-14T11:31:28Z)" id=01bz207xxmc18dppgxwgywr5zs language=python time="2017-11-16T08:48:35Z" level=info msg="server listening in /tmp/rpc.sock (unix)" id=01bz207xxmc18dppgxwgywr5zs language=python time="2017-11-16T08:48:36Z" level=info msg="new driver instance started bblfsh/python-driver:latest (01bz207xxmc18dppgxwgywr5zs)" time="2017-11-16T08:49:07Z" level=info msg="python-driver version: dev-1908ca8 (build: 2017-11-14T11:31:28Z)" id=01bz208xey432evff0pga9dnxr language=python time="2017-11-16T08:49:07Z" level=info msg="server listening in /tmp/rpc.sock (unix)" id=01bz208xey432evff0pga9dnxr language=python time="2017-11-16T12:24:51Z" level=error msg="error re-scaling pool: container is not destroyed" language=python ``` apache spark thread dump ``` org.bblfsh.client.BblfshClient.filter(BblfshClient.scala:33) tech.sourced.engine.udf.QueryXPathUDF$$anonfun$queryXPath$2.apply(QueryXPathUDF.scala:45) tech.sourced.engine.udf.QueryXPathUDF$$anonfun$queryXPath$2.apply(QueryXPathUDF.scala:44) scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) tech.sourced.engine.udf.QueryXPathUDF$.queryXPath(QueryXPathUDF.scala:44) ```

Steps to reproduce, using 30 concurrent clients:

// get Borges from https://github.com/src-d/borges/releases/tag/v0.8.3
echo -e "https://github.com/src-d/borges.git\nhttps://github.com/erizocosmico/borges.git\nhttps://github.com/jelmer/dulwich.git" > repos.txt
borges pack --loglevel=debug --workers=2 --to=./repos -f repos.txt

// get Apache Spark https://github.com/src-d/engine#quick-start
$SPARK_HOME/bin/spark-shell --driver-memory=4g --packages "tech.sourced:engine:0.1.7"

and then run :paste, paste code below and hit Ctrl+D

import tech.sourced.engine._

val engine = Engine(spark, "repos")
val repos = engine.getRepositories
val refs = repos.getHEAD.withColumnRenamed("hash","commit_hash")

val langs = refs.getFiles.classifyLanguages
val pyTokens = langs
  .where('lang === "Python")
  .extractUASTs.queryUAST("//*[@roleIdentifier]", "uast", "result")
  .extractTokens("result", "tokens")

val tokensToWrite = pyTokens
  .join(refs, "commit_hash")
  .select('repository_id, 'name, 'commit_hash, 'file_hash, 'path, 'lang, 'tokens)

spark.conf.set("spark.sql.shuffle.partitions", "30") //instead of default 200
tokensToWrite.show

then, if exec'ed to bblfshd container, one can see number of driver processes growing

apt-get update && apt-get install -y procps
ps aux | wc -l

bzz commented 6 years ago

Update: same happens with just 3 clients instead of 30, it only takes longer, ~30min to reproduce.

Relevant issue exists in Engine src-d/engine#196 this one is just about zombie processes in bblfshd container.

bzz commented 6 years ago

Noticed two new errors in logs, that are not posted above

time="2017-11-16T12:20:44Z" level=error msg="request processed content 3487 bytes, status Fatal" elapsed=43.828518ms language=python
time="2017-11-16T12:24:51Z" level=error msg="error re-scaling pool: container is not destroyed" language=python

Going to post logs \w debug enabled

bzz commented 6 years ago

Here are debug logs \w 93 processes inside container 93-process-bblfshd.log

juanjux commented 6 years ago

Yes, I can reproduce it, thanks for the steps and the (as always in your case) really awesome bug report, @bzz.

The processes are runc zombie processes that doesn't use resources, so this should not be a performance problem, but it's certainly not pretty having all those zombies until a bblfshd container restart (bblfshd pid is the parent of the zombie herd, as you can see with a ps -l to show the PPID). It's odd because libcontainer some time ago merged a PR that reaped the zombie processes and fixed a similar issue we had. I'll try to update the dependency to the latest version in bblfshd and if that doesn't fix the probem I'll investigate if we're doing something wrong in our process management.

juanjux commented 6 years ago

Looks like libcontainer from ~master avoids this problem (I've let it running for 20 minutes and there isn't a single defunc process). I'll upload a exported docker image of this version of bblfshd for you to test and if you confirm that it works we can close this after the PR. More details on Slack.

juanjux commented 6 years ago

After more tests, I still see some defunct driver processes after leaving this test running for a while, but there are like 3-5 after 40 minutes while previously there were hundreds, so while not totally fixed, it's an huge improvement.

Considering that the change was just updating the libcontainer dependency, the problem is surely there.

bzz commented 6 years ago

@juanjux It will take few days for me to get back to this to reproduce it, so we can either re-open or keep it here for a while and see.

Sorry, closed by mistake :/

juanjux commented 6 years ago

Ok, so if the maintainer @abeaumont agrees we can merge #138 and close this, feel free to reopen if you find the problem again.

abeaumont commented 6 years ago

Done

bblfsh / bblfshd

Growing number of driver processes #130