Islandora / documentation

Contains islandora's documentation and main issue queue.
MIT License
104 stars 71 forks source link

Houdini "java.net.SocketException: Too many open files" #978

Closed seth-shaw-unlv closed 5 years ago

seth-shaw-unlv commented 5 years ago

Trying to create derivatives for a new collection of 1734 images is failing. Camel is reporting a SocketException with too many open files:

2018-11-15 15:40:22,854 | DEBUG | nnector-houdini] | DefaultErrorHandler              | 84 - org.apache.camel.camel-core - 2.19.2 | Failed delivery for (MessageId: ID-dams-37173-1542151183356-7-22070 on ExchangeId: ID-dams-37173-1542151183356-7-22060). On delivery attempt: 10 caught: java.net.SocketException: Too many open files
2018-11-15 15:40:22,855 | ERROR | nnector-houdini] | DefaultErrorHandler              | 84 - org.apache.camel.camel-core - 2.19.2 | Failed delivery for (MessageId: ID-dams-37173-1542151183356-7-22070 on ExchangeId: ID-dams-37173-1542151183356-7-22060). Exhausted after delivery attempt: 11 caught: java.net.SocketException: Too many open files. Processed by failure processor: FatalFallbackErrorHandler[Channel[Log(ca.islandora.alpaca.connector.houdini.HoudiniConnector)[Error connecting generating derivative with Houdini: ${exception.message}

${exception.stacktrace}]]]

Message History
---------------------------------------------------------------------------------------------------------------------------------------
RouteId              ProcessorId          Processor                                                                        Elapsed (ms)
[IslandoraConnector] [IslandoraConnector] [activemq://queue:islandora-connector-houdini                                  ] [     16975]
[IslandoraConnector] [unmarshal6        ] [unmarshal[org.apache.camel.model.dataformat.JsonDataFormat@28a80bd5]          ] [         1]
[IslandoraConnector] [setProperty15     ] [setProperty[event]                                                            ] [         0]
[IslandoraConnector] [removeHeaders10   ] [removeHeaders[*]                                                              ] [         0]
[IslandoraConnector] [setHeader35       ] [setHeader[CamelHttpMethod]                                                    ] [         0]
[IslandoraConnector] [setHeader36       ] [setHeader[Accept]                                                             ] [         0]
[IslandoraConnector] [setHeader37       ] [setHeader[X-Islandora-Args]                                                   ] [         0]
[IslandoraConnector] [setHeader38       ] [setHeader[Apix-Ldp-Resource]                                                  ] [         0]
[IslandoraConnector] [setBody12         ] [setBody[simple{Simple: ${null}}]                                              ] [         0]
[IslandoraConnector] [to42              ] [{{houdini.convert.url}}                                                       ] [      6946]
[IslandoraConnector] [removeHeaders11   ] [removeHeaders[*]                                                              ] [         0]
[IslandoraConnector] [setHeader39       ] [setHeader[Content-Location]                                                   ] [         0]
[IslandoraConnector] [setHeader40       ] [setHeader[CamelHttpMethod]                                                    ] [         1]
[IslandoraConnector] [toD7              ] [                                                                              ] [     10027]
[IslandoraConnector] [log13             ] [log                                                                           ] [         1]

Stacktrace
---------------------------------------------------------------------------------------------------------------------------------------
java.net.SocketException: Too many open files
    at java.net.Socket.createImpl(Socket.java:460)[:1.8.0_191]
    at java.net.Socket.getImpl(Socket.java:520)[:1.8.0_191]
    at java.net.Socket.setSoTimeout(Socket.java:1141)[:1.8.0_191]
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:120)[158:org.apache.httpcomponents.httpclient:4.5.3]
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)[158:org.apache.httpcomponents.httpclient:4.5.3]
    at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)[158:org.apache.httpcomponents.httpclient:4.5.3]
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)[158:org.apache.httpcomponents.httpclient:4.5.3]
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)[158:org.apache.httpcomponents.httpclient:4.5.3]
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)[158:org.apache.httpcomponents.httpclient:4.5.3]
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)[158:org.apache.httpcomponents.httpclient:4.5.3]
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)[158:org.apache.httpcomponents.httpclient:4.5.3]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)[158:org.apache.httpcomponents.httpclient:4.5.3]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)[158:org.apache.httpcomponents.httpclient:4.5.3]
    at org.apache.camel.component.http4.HttpProducer.executeMethod(HttpProducer.java:334)[86:org.apache.camel.camel-http4:2.19.2]
    at org.apache.camel.component.http4.HttpProducer.process(HttpProducer.java:193)[86:org.apache.camel.camel-http4:2.19.2]
    at org.apache.camel.util.AsyncProcessorConverterHelper$ProcessorToAsyncProcessorBridge.process(AsyncProcessorConverterHelper.java:61)[84:org.apache.camel.camel-core:2.19.2]
    at org.apache.camel.processor.SendDynamicProcessor$1.doInAsyncProducer(SendDynamicProcessor.java:124)[84:org.apache.camel.camel-core:2.19.2]
    at org.apache.camel.impl.ProducerCache.doInAsyncProducer(ProducerCache.java:436)[84:org.apache.camel.camel-core:2.19.2]
    at org.apache.camel.processor.SendDynamicProcessor.process(SendDynamicProcessor.java:119)[84:org.apache.camel.camel-core:2.19.2]
    at org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:77)[84:org.apache.camel.camel-core:2.19.2]
    at org.apache.camel.processor.RedeliveryErrorHandler.process(RedeliveryErrorHandler.java:541)[84:org.apache.camel.camel-core:2.19.2]
    at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:198)[84:org.apache.camel.camel-core:2.19.2]
    at org.apache.camel.processor.Pipeline.process(Pipeline.java:120)[84:org.apache.camel.camel-core:2.19.2]
    at org.apache.camel.processor.Pipeline.process(Pipeline.java:83)[84:org.apache.camel.camel-core:2.19.2]
    at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:198)[84:org.apache.camel.camel-core:2.19.2]
    at org.apache.camel.processor.DelegateAsyncProcessor.process(DelegateAsyncProcessor.java:97)[84:org.apache.camel.camel-core:2.19.2]
    at org.apache.camel.component.jms.EndpointMessageListener.onMessage(EndpointMessageListener.java:112)[90:org.apache.camel.camel-jms:2.19.2]
    at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:721)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
    at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:681)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
    at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:651)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
    at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:313)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
    at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:251)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
    at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1164)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
    at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1156)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
    at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1053)[117:org.apache.servicemix.bundles.spring-jms:4.2.8.RELEASE_1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)[:1.8.0_191]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)[:1.8.0_191]
    at java.lang.Thread.run(Thread.java:748)[:1.8.0_191]
2018-11-15 15:40:22,855 | DEBUG | nnector-houdini] | Pipeline                         | 84 - org.apache.camel.camel-core - 2.19.2 | Message exchange has failed: so breaking out of pipeline for exchange: Exchange[ID-dams-37173-1542151183356-7-22060] Exception: java.net.SocketException: Too many open files
2018-11-15 15:40:22,855 | WARN  | nnector-houdini] | EndpointMessageListener          | 84 - org.apache.camel.camel-core - 2.19.2 | Execution of JMS message listener failed. Caused by: [org.apache.camel.RuntimeCamelException - java.net.SocketException: Too many open files]
org.apache.camel.RuntimeCamelException: java.net.SocketException: Too many open files
    at org.apache.camel.util.ObjectHelper.wrapRuntimeCamelException(ObjectHelper.java:1774)[84:org.apache.camel.camel-core:2.19.2]
... trimmed

The houdini log doesn't show any problems, other than abruptly stopping a while ago (presumably because no more calls were being issued).

It looks like we aren't closing a connection OR we need to rate-limit requests somehow.

Unfortunately, I am heading out of the office now and won't be back until 2018-11-26. I'll try to debug some more then.

whikloj commented 5 years ago

Seems like maybe Drupal is too slow compared to convert and we need to limit the number of messages it tries to process at one time. Perhaps setting a limit on the number of activemq consumers is the way to go.

dannylamb commented 5 years ago

@seth-shaw-unlv @whikloj I need to verify we're not leaking files first, which is difficult. I've been pretty careful about it and I'd like to say it's not that, but it's the most probable reason for hitting the limit.

Regardless, we'll have to throttle in some way. Due to PHP peculiarities, we're forced to use to a temp stream in CmdExecuteService for output, which opens a file if it is > 4kb. So all the derivative generation microservices are likely to hit the open file limit under load. So if everything's on the same box, then... :bomb::bomb::bomb:

As @whikloj suggests, we'll want to make sure messages are not being processed in parallel with activemq. If we still hit the limit after dropping the number of concurrent consumers to 1, then we'll have to use camel's throttler on the derivative generation routes. Basically, just throw a .throttle(100) at the beginning of the route.

whikloj commented 5 years ago

To slow and make careful my derivative-toolkit I setup a pool of activemq consumers and that limits how many we open at a time.

I used this block, then left the ${concurrent.consumers} as a configurable option so you can increase or decrease the number of consumers.

whikloj commented 5 years ago

I'll also note that I discovered that by default each ActiveMQ consumer pre-fetches about 1,000 messages. I had a problem with OpenStack clients losing connection to push work back, so I limited the pre-fetch to 1 so if it failed I lost 1 message instead of 1,000.

blueprint config - https://github.com/whikloj/islandora-1x-derivative-toolkit/blob/master/islandora-1x-derivative-worker/src/main/resources/OSGI-INF/blueprint/blueprint.xml#L17 Add the parameter to the route - https://github.com/whikloj/islandora-1x-derivative-toolkit/blob/master/islandora-1x-derivative-worker/src/main/resources/OSGI-INF/blueprint/blueprint.xml#L17

dannylamb commented 5 years ago

@whikloj Yeah, I've been bitten by Activemq's prefetch before, too. Thanks for bringing that up and the examples. It's pretty straight-forward from that.

seth-shaw-unlv commented 5 years ago

So, I did some more looking around where others have run into this type of error. The near-universal solution was to increase the system's maximum number of open file descriptors. There are several blog posts that talk about this, although the Nuxeo one is probably the simplest and not cluttered by ads.

In short,

  1. sudo vi /etc/security/limits.conf
  2. Add * soft nofile 4096 and * hard nofile 8192 to the bottom. The stars can be replaced with specific users you want to change limits to (and karaf seems to be the biggest offender) but nobody explained why you wouldn't want to apply it to everyone.
  3. System restart so that all the users will have the new limit applied.

I made the stated changes to our CentOS 7 box and kicked off the media migrations again. I was able to complete the media migration without any camel errors. The ActiveMQ islandora-connector-houdini queue shows that houdini is still chugging happily along progressively generating service images and thumbnail messages are being added to the queue without problems.

seth-shaw-unlv commented 5 years ago

Basically, what it boils down to is having too many services trying to run on the same machine with their own open files and socket connections. It just gets too crowded with the default system settings. So, either allow users more file connections or subdivide your setup across multiple boxes.

I think we can probably close this issue with the workaround. The prefetch and pooling are probably a separate ticket. Thoughts on that @dannylamb and @whikloj ?

dannylamb commented 5 years ago

@seth-shaw-unlv That's great news. Pro-actively we could bake that into claw-playbook or maybe the karaf role directly since it's the biggest offender? I imagine even if you sub-divide your setup across multiple boxes, you'll still wanna max out those numbers. Otherwise, I'd consider this a documentation issue, I guess. Although I'm not sure where the best place to put it would be.

Prefetch and pooling will definitely come into play later if we ever reach the new limits, so a separate ticket is a good idea.

seth-shaw-unlv commented 5 years ago

After running into this again, I have some additional notes:

Using the * for limits.conf will not apply to root. To adjust root give it it's own lines:

root soft nofile 900000
root hard nofile 900000

Also, karaf has been using more than 500k open "files" at a time. I didn't realize this at first because I was running lsof as an unprivileged user which showed karaf had < 200 open files, but running the command as root revealed karaf using ~512k open files. I'm hoping my new 900k limit will avoid any more problems.

seth-shaw-unlv commented 5 years ago

Nope. Still hitting the wall. It looks like we have a problem with connections not closing. In a recent test I started seeing the errors again so I checked lsof. Karaf had 518,906 file descriptors open. 460,320 of them (88.7%) were TCP connections with the status of CLOSE_WAIT. That means that karaf (or the code it is executing) is holding onto connections a lot longer than it should. Which connections (houdini, Fedora, Drupal...) is unknown.

dannylamb commented 5 years ago

@seth-shaw-unlv It's hard to tell by Karaf because Karaf connects to everything. My suspicion is the CmdExecuteService in Crayfish Commons. That'd make Apache hold on to the files too long, but I'm not sure how to confirm.

lsof wants a PID and I'm not entirely sure how to get one for Apache.

vagrant@claw:~$ ps ax | grep apache
 1092 ?        Sl     0:00 /opt/apache-karaf-4.0.8/bin/karaf-wrapper /opt/karaf/etc/karaf-wrapper.conf wrapper.syslog.ident=karaf wrapper.pidfile=/opt/apache-karaf-4.0.8/data/karaf.pid wrapper.daemonize=TRUE wrapper.lockfile=/var/lock/subsys/karaf
 1164 ?        Sl     0:28 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dkaraf.home=/opt/apache-karaf-4.0.8 -Dkaraf.base=/opt/apache-karaf-4.0.8 -Dkaraf.data=/opt/apache-karaf-4.0.8/data -Dkaraf.etc=/opt/karaf/etc -Dcom.sun.management.jmxremote -Dkaraf.startLocalConsole=false -Dkaraf.startRemoteShell=true -Djava.endorsed.dirs=/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/endorsed:/usr/lib/jvm/java-8-openjdk-amd64/lib/endorsed:/opt/apache-karaf-4.0.8/lib/endorsed -Djava.ext.dirs=/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext:/usr/lib/jvm/java-8-openjdk-amd64/lib/ext:/opt/apache-karaf-4.0.8/lib/ext -Xmx512m -Djava.library.path=/opt/apache-karaf-4.0.8/lib/wrapper/ -classpath /opt/apache-karaf-4.0.8/lib/boot/org.apache.karaf.main-4.0.8.jar:/opt/apache-karaf-4.0.8/lib/boot/org.apache.karaf.jaas.boot-4.0.8.jar:/opt/apache-karaf-4.0.8/lib/boot/org.apache.karaf.diagnostic.boot-4.0.8.jar:/opt/apache-karaf-4.0.8/lib/boot/org.osgi.core-6.0.0.jar:/opt/apache-karaf-4.0.8/lib/wrapper/karaf-wrapper.jar:/opt/apache-karaf-4.0.8/lib/wrapper/karaf-wrapper-main.jar -Dwrapper.key=KnCm_3MUMtEPLNqU -Dwrapper.port=32000 -Dwrapper.jvm.port.min=31000 -Dwrapper.jvm.port.max=31999 -Dwrapper.pid=1092 -Dwrapper.version=3.2.3 -Dwrapper.native_library=wrapper -Dwrapper.service=TRUE -Dwrapper.cpu.timeout=10 -Dwrapper.jvmid=1 org.apache.karaf.wrapper.internal.service.Main
 1476 ?        Sl     1:08 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Djava.util.logging.config.file=/var/lib/tomcat8/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server -Xms512m -Xmx1024m -XX:NewSize=256m -XX:MaxNewSize=256m -XX:PermSize=256m -XX:MaxPermSize=256m -Dfcrepo.home=/opt/fcrepo -Dfcrepo.modeshape.configuration=file:///opt/fcrepo/configs/repository.json -Dfcrepo.activemq.configuration=file:///opt/fcrepo/configs/activemq.xml -Dfcrepo.spring.configuration=file:///opt/fcrepo/configs/fcrepo-config.xml -Dcom.bigdata.rdf.sail.webapp.ConfigParams.propertyFile=/opt/blazegraph/conf/RWStore.properties -Dcantaloupe.config=/opt/cantaloupe/cantaloupe.properties -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true -Djava.endorsed.dirs=/usr/share/tomcat8/endorsed -classpath /usr/share/tomcat8/bin/bootstrap.jar:/usr/share/tomcat8/bin/tomcat-juli.jar -Dcatalina.base=/var/lib/tomcat8 -Dcatalina.home=/usr/share/tomcat8 -Djava.io.tmpdir=/tmp/tomcat8-tomcat8-tmp org.apache.catalina.startup.Bootstrap start
 1510 ?        Ss     0:00 /usr/sbin/apache2 -k start
 1611 ?        S      0:00 /usr/sbin/apache2 -k start
 1612 ?        S      0:00 /usr/sbin/apache2 -k start
 1615 ?        S      0:00 /usr/sbin/apache2 -k start
 1616 ?        S      0:00 /usr/sbin/apache2 -k start
 1617 ?        S      0:00 /usr/sbin/apache2 -k start
 2350 pts/0    S+     0:00 grep --color=auto apache

That seems like a lot of starts for /usr/sbin/apache2

dannylamb commented 5 years ago

Resolved via https://github.com/Islandora-CLAW/Alpaca/commit/67eac07e1c4064e942c45d4d60f5138ed0fa4aa9