Closed seth-shaw-unlv closed 5 years ago
Seems like maybe Drupal is too slow compared to convert and we need to limit the number of messages it tries to process at one time. Perhaps setting a limit on the number of activemq consumers is the way to go.
@seth-shaw-unlv @whikloj I need to verify we're not leaking files first, which is difficult. I've been pretty careful about it and I'd like to say it's not that, but it's the most probable reason for hitting the limit.
Regardless, we'll have to throttle in some way. Due to PHP peculiarities, we're forced to use to a temp stream in CmdExecuteService
for output, which opens a file if it is > 4kb. So all the derivative generation microservices are likely to hit the open file limit under load. So if everything's on the same box, then... :bomb::bomb::bomb:
As @whikloj suggests, we'll want to make sure messages are not being processed in parallel with activemq. If we still hit the limit after dropping the number of concurrent consumers to 1, then we'll have to use camel's throttler on the derivative generation routes. Basically, just throw a .throttle(100)
at the beginning of the route.
To slow and make careful my derivative-toolkit I setup a pool of activemq consumers and that limits how many we open at a time.
I used this block, then left the ${concurrent.consumers}
as a configurable option so you can increase or decrease the number of consumers.
I'll also note that I discovered that by default each ActiveMQ consumer pre-fetches about 1,000 messages. I had a problem with OpenStack clients losing connection to push work back, so I limited the pre-fetch to 1 so if it failed I lost 1 message instead of 1,000.
blueprint config - https://github.com/whikloj/islandora-1x-derivative-toolkit/blob/master/islandora-1x-derivative-worker/src/main/resources/OSGI-INF/blueprint/blueprint.xml#L17 Add the parameter to the route - https://github.com/whikloj/islandora-1x-derivative-toolkit/blob/master/islandora-1x-derivative-worker/src/main/resources/OSGI-INF/blueprint/blueprint.xml#L17
@whikloj Yeah, I've been bitten by Activemq's prefetch before, too. Thanks for bringing that up and the examples. It's pretty straight-forward from that.
So, I did some more looking around where others have run into this type of error. The near-universal solution was to increase the system's maximum number of open file descriptors. There are several blog posts that talk about this, although the Nuxeo one is probably the simplest and not cluttered by ads.
In short,
sudo vi /etc/security/limits.conf
* soft nofile 4096
and * hard nofile 8192
to the bottom. The stars can be replaced with specific users you want to change limits to (and karaf seems to be the biggest offender) but nobody explained why you wouldn't want to apply it to everyone.I made the stated changes to our CentOS 7 box and kicked off the media migrations again. I was able to complete the media migration without any camel errors. The ActiveMQ islandora-connector-houdini queue shows that houdini is still chugging happily along progressively generating service images and thumbnail messages are being added to the queue without problems.
Basically, what it boils down to is having too many services trying to run on the same machine with their own open files and socket connections. It just gets too crowded with the default system settings. So, either allow users more file connections or subdivide your setup across multiple boxes.
I think we can probably close this issue with the workaround. The prefetch and pooling are probably a separate ticket. Thoughts on that @dannylamb and @whikloj ?
@seth-shaw-unlv That's great news. Pro-actively we could bake that into claw-playbook or maybe the karaf role directly since it's the biggest offender? I imagine even if you sub-divide your setup across multiple boxes, you'll still wanna max out those numbers. Otherwise, I'd consider this a documentation issue, I guess. Although I'm not sure where the best place to put it would be.
Prefetch and pooling will definitely come into play later if we ever reach the new limits, so a separate ticket is a good idea.
After running into this again, I have some additional notes:
Using the * for limits.conf will not apply to root. To adjust root give it it's own lines:
root soft nofile 900000
root hard nofile 900000
Also, karaf has been using more than 500k open "files" at a time. I didn't realize this at first because I was running lsof as an unprivileged user which showed karaf had < 200 open files, but running the command as root revealed karaf using ~512k open files. I'm hoping my new 900k limit will avoid any more problems.
Nope. Still hitting the wall. It looks like we have a problem with connections not closing. In a recent test I started seeing the errors again so I checked lsof. Karaf had 518,906 file descriptors open. 460,320 of them (88.7%) were TCP connections with the status of CLOSE_WAIT. That means that karaf (or the code it is executing) is holding onto connections a lot longer than it should. Which connections (houdini, Fedora, Drupal...) is unknown.
@seth-shaw-unlv It's hard to tell by Karaf because Karaf connects to everything. My suspicion is the CmdExecuteService
in Crayfish Commons. That'd make Apache hold on to the files too long, but I'm not sure how to confirm.
lsof
wants a PID and I'm not entirely sure how to get one for Apache.
vagrant@claw:~$ ps ax | grep apache
1092 ? Sl 0:00 /opt/apache-karaf-4.0.8/bin/karaf-wrapper /opt/karaf/etc/karaf-wrapper.conf wrapper.syslog.ident=karaf wrapper.pidfile=/opt/apache-karaf-4.0.8/data/karaf.pid wrapper.daemonize=TRUE wrapper.lockfile=/var/lock/subsys/karaf
1164 ? Sl 0:28 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dkaraf.home=/opt/apache-karaf-4.0.8 -Dkaraf.base=/opt/apache-karaf-4.0.8 -Dkaraf.data=/opt/apache-karaf-4.0.8/data -Dkaraf.etc=/opt/karaf/etc -Dcom.sun.management.jmxremote -Dkaraf.startLocalConsole=false -Dkaraf.startRemoteShell=true -Djava.endorsed.dirs=/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/endorsed:/usr/lib/jvm/java-8-openjdk-amd64/lib/endorsed:/opt/apache-karaf-4.0.8/lib/endorsed -Djava.ext.dirs=/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext:/usr/lib/jvm/java-8-openjdk-amd64/lib/ext:/opt/apache-karaf-4.0.8/lib/ext -Xmx512m -Djava.library.path=/opt/apache-karaf-4.0.8/lib/wrapper/ -classpath /opt/apache-karaf-4.0.8/lib/boot/org.apache.karaf.main-4.0.8.jar:/opt/apache-karaf-4.0.8/lib/boot/org.apache.karaf.jaas.boot-4.0.8.jar:/opt/apache-karaf-4.0.8/lib/boot/org.apache.karaf.diagnostic.boot-4.0.8.jar:/opt/apache-karaf-4.0.8/lib/boot/org.osgi.core-6.0.0.jar:/opt/apache-karaf-4.0.8/lib/wrapper/karaf-wrapper.jar:/opt/apache-karaf-4.0.8/lib/wrapper/karaf-wrapper-main.jar -Dwrapper.key=KnCm_3MUMtEPLNqU -Dwrapper.port=32000 -Dwrapper.jvm.port.min=31000 -Dwrapper.jvm.port.max=31999 -Dwrapper.pid=1092 -Dwrapper.version=3.2.3 -Dwrapper.native_library=wrapper -Dwrapper.service=TRUE -Dwrapper.cpu.timeout=10 -Dwrapper.jvmid=1 org.apache.karaf.wrapper.internal.service.Main
1476 ? Sl 1:08 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Djava.util.logging.config.file=/var/lib/tomcat8/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server -Xms512m -Xmx1024m -XX:NewSize=256m -XX:MaxNewSize=256m -XX:PermSize=256m -XX:MaxPermSize=256m -Dfcrepo.home=/opt/fcrepo -Dfcrepo.modeshape.configuration=file:///opt/fcrepo/configs/repository.json -Dfcrepo.activemq.configuration=file:///opt/fcrepo/configs/activemq.xml -Dfcrepo.spring.configuration=file:///opt/fcrepo/configs/fcrepo-config.xml -Dcom.bigdata.rdf.sail.webapp.ConfigParams.propertyFile=/opt/blazegraph/conf/RWStore.properties -Dcantaloupe.config=/opt/cantaloupe/cantaloupe.properties -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true -Djava.endorsed.dirs=/usr/share/tomcat8/endorsed -classpath /usr/share/tomcat8/bin/bootstrap.jar:/usr/share/tomcat8/bin/tomcat-juli.jar -Dcatalina.base=/var/lib/tomcat8 -Dcatalina.home=/usr/share/tomcat8 -Djava.io.tmpdir=/tmp/tomcat8-tomcat8-tmp org.apache.catalina.startup.Bootstrap start
1510 ? Ss 0:00 /usr/sbin/apache2 -k start
1611 ? S 0:00 /usr/sbin/apache2 -k start
1612 ? S 0:00 /usr/sbin/apache2 -k start
1615 ? S 0:00 /usr/sbin/apache2 -k start
1616 ? S 0:00 /usr/sbin/apache2 -k start
1617 ? S 0:00 /usr/sbin/apache2 -k start
2350 pts/0 S+ 0:00 grep --color=auto apache
That seems like a lot of start
s for /usr/sbin/apache2
Trying to create derivatives for a new collection of 1734 images is failing. Camel is reporting a SocketException with too many open files:
The houdini log doesn't show any problems, other than abruptly stopping a while ago (presumably because no more calls were being issued).
It looks like we aren't closing a connection OR we need to rate-limit requests somehow.
Unfortunately, I am heading out of the office now and won't be back until 2018-11-26. I'll try to debug some more then.