freme-project / FREMECommon

FREME Common Codes
Apache License 2.0
0 stars 0 forks source link

Too many open files #22

Open jnehring opened 8 years ago

jnehring commented 8 years ago

FREME live becomes unstable after running for a while with this error:

ERROR   2016-04-26 08:42:24,909 [http-nio-8004-Acceptor-0] org.apache.tomcat.util.net.NioEndpoint  - Socket accept failed
java.io.IOException: Zu viele offene Dateien
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
        at org.apache.tomcat.util.net.NioEndpoint$Acceptor.run(NioEndpoint.java:686)
        at java.lang.Thread.run(Thread.java:745)

It happened two weeks ago and now it happened again. The problem can be solved through restarting freme-live.

I backuped the log files (around 500 mb or error logs) in ~/logs-open-files on freme-live. I guess the problem is an open file handle or stream that is not closed properly. When this happened 20.000 times then freme-live fails. Probably it is caused by the WRIPL requests because no one else does so many requests in two weeks on freme-live.

jnehring commented 8 years ago

The bug happend again. We should fix it soon.

jnehring commented 8 years ago

I can test the issue by starting FREME on ubuntu and using this command to count the number of open file handles:

lsof -a -p 3919 | wc -l

where 3919 is the process id of freme.

The number of file handles increases with each API call that uses e-Internationalisation. I could find and fix two streams that where not closed properly and reduce the number of newly opened streams by each API call from 3 to 1.

Next step: Find the last stream that is not closed properly.

ArneBinder commented 8 years ago

This hould be fixed. @jnehring could you check this on freme-dev? This does not solve https://github.com/freme-project/basic-services/issues/32.

jnehring commented 8 years ago

I tested it. The problem is fixed. Great!

jnehring commented 8 years ago

The bug happened again today.

jnehring commented 8 years ago

The bug happened again and I could solve it again by restarting the server. I could not find the reason for the problem. I cannot even reproduce it locally. Next step to find the bug: I want to run the broker on my local VM and process a larger dataset (e.g. 1000 documents) with it using the same pipeline that WRIPL uses.

jnehring commented 8 years ago

I analysed the content going to FREME. When putting the following content in /etc/modsecurity/modsecurity.conf then all content send to FREME is logged in /var/log/apache2/modsec_audit.log:

SecRuleEngine On
SecAuditEngine On
SecAuditLog /var/log/apache2/modsec_audit.log
SecRequestBodyAccess on
SecAuditLogParts ABIJDFHZ

Through this logging I found out that some special characters lead to crashes in e-Link. Processing a pipeline with the following content leaves 10 open file handles. It was executed on a FREME test installation on freme-live.

[{"method":"POST","endpoint":"http:\/\/api.freme-project.eu:5000\/e-entity\/freme-ner\/documents","parameters":{"language":"en","dataset":"dbpedia"},"headers":{"content-type":"text\/plain","accept":"text\/turtle"},"body":"close Now Pinned to My Dublin Map.Register Now to save and share your pins. Explore 0 Login \/ Register Search OFF 0 close See & Do Insider Tips My Dublin Map What's On Login \/ Register About Us Contact Us SEARCH See & Do Free Events What's On Casement Evening Lecture Series: In Conversation: Chris Clarke with Alan Phelan Free Events Inspired by the notion behind Rising leader Roger Casements legendary words, Our Kind, a short film created by DCU graduate and Irish artist Alan Phelan, takes creative liberty with the familiar narrative and political history of the Rising, imagining a world in which Casement had instead escaped to exile in Norway. Join curator and art critic Chris Clarke for this fascinating conversation with Phelan, which will seek to uncover how the script was approached and developed, and the historical context from which he drew. Bring an open mind and prepare to discover what lessons we can learn from an alternate take on history. Admission Free Event Dates Pin To Map address Charlemont House, Parnell Square North Phone +353(0)12225550 Web hughlane.ie Email info.hughlane@dublincity.ie LOCATION SEE WHAT'S NEARBY Special Offers Shopping Tours Visitor Attractions Activities Accommodation Eat, Drink, Nightlife What's On Close Filters Apply Filters EVENT DATES Thu, 1 September 2016 18:30 - 19:30 Events Seasonal Bram Stoker Festival 28 Oct 2016 - 31 Oct 2016 Free Events Simon Fujiwara: The Humanizer 20 May 2016 - 28 Aug 2016 Seasonal Historic Dn Laoghaire 14 Jul 2016 - 01 Sep 2016 Free Events Angela Morgan 01 Sep 2016 - 29 Sep 2016 Free Events Breaking Rainbows: Orla Barry 29 Sep 2016 - 05 Nov 2016 Insider Tips READ MORE More Than A Stage: Olympia Theatre READ MORE Hooked on Howth READ MORE Insider Tips from Dublin's Top Concierges READ MORE 20 Reasons to #LoveDublin READ MORE More Than A Stage: Olympia Theatre What's On My Dublin Map Insider Tips See & Do 2016 Visit Dublin About Us Terms & Conditions Privacy & Cookies Dublin Pass Sightseeing Card Dublin Convention Bureau Contact Us Newsletter Site Map"},{"method":"POST","endpoint":"http:\/\/api.freme-project.eu:5000\/e-link\/documents\/","parameters":{"templateid":"1"},"headers":{"content-type":"text\/turtle","accept":"application\/json+ld"}}]

Next steps:

ghsnd commented 8 years ago

Maybe you can see if it are connections (sockets) and what the connections are with netstat or ss ?

jnehring commented 8 years ago

The problem does not occur any more on freme-live. We have an apache webserver on freme-live that proxies the requests to our API. This proxy was not configured properly, the module mod_xml2enc was not installed. Therefore special characters got mangled up. These special characters caused the open streams. I installed mod_xml2end and now the number of open streams is constant.

I leave this issue open because there is still a problem in FREME which we should find. But hopefully the broker does not crash all the time now.

jnehring commented 8 years ago

The bug still occurs.

x-fran commented 8 years ago

Under what OS is running your server? Who is the owner/group of the files of FREME NER application? Please run for me in the command line as root the following command "ulimit -a" and send me the output please.

ghsnd commented 8 years ago

And can you dump the heap of the java process(es), when there a many open files, like so:

jmap -dump:format=b,file=<filename>.bin <pid>

where <pid> is the id of the java process.

Then, opening this file(s) in visualvm or jvisualvm (part of the JDK), might allow to see instances of Streams, Files, Sockets, or whatever causes the problem and where they are initialised.

So can you do this, compress the dump(s) (xz works good here) and send it/them to me or put them temporarily on Google Drive?

jnehring commented 8 years ago

@xFran I send you the data via email.

x-fran commented 8 years ago

@jnehring answered via email. Please let me know if any issues or improvements.

ghsnd commented 8 years ago

It seems that there are 98 com.hp.hpl.jena.sparql.engine.http.HttpQuery instances. These contain SPARQL queries posed to a service on port 8890/sparql. Each instance initialises an org.apache.http.impl.client.DecompressingHttpClient. Each such instance initialises an org.apache.http.iml.conn.PoolingClientConnectionManager. Each pool can have a number of connections (10, I think). In any case, this causes more than 800 connections, that are done (EOF) but not closed.

If this causes the problem (not sure!), then my guess is that somewhere in e-Link a Jena model, query or iterator is not closed properly.

ghsnd commented 8 years ago

It's mostly about CONSTRUCT queries.

ghsnd commented 8 years ago

I pushed some changed to eu.freme.eservices.elink.api.DataEnricher, where some resources were not (always) closed in DataEnricher. Most important one on line 92

@jnehring You have to trigger a Jenkins build, because I seem to have credentials problems (see the console output)

ghsnd commented 8 years ago

Remark: I didn't test!. Hope I didn't close too much or too early :)

jnehring commented 8 years ago

I manually triggered the re-build and then it re-build the project. Tested e-Link, still works. I will do the release and then we can see if it fixes the bug.

jnehring commented 8 years ago

I have installed the new version of e-link that @ghsnd created on freme-live during https://github.com/freme-project/technical-discussion/issues/122

Lets hope that this solves the problem

jnehring commented 8 years ago

The bug is still there but it is not as problematic as before. Since friday FREME was used a lot and now there are 2467 open files. I cannot easily determine the exact amount of requests but last weekend FREME would have produced more open files and would have crashed.

ghsnd commented 8 years ago

It could be that there are still other resources not being closed somewhere else in the code. If you can make a heap dump again, I will take a look. Another explanation is that the bug was not fixed, but that it is not triggered as much as before.

Or... all partners double search some part of the code base for resources that are not closed ;)

jnehring commented 8 years ago

I restarted FREME live so it does not crash so all open files are gone. I will do another heap dump but I need to wait for a couple of days so there are some open files.

jnehring commented 8 years ago

I found one more stream that is not closed properly in the SPARQL converters controller and added the finally statement to it. It is a stream that writes a SPARQL SELECT queries result set to CSV, XML or JSON format.

ghsnd commented 8 years ago

Closed some resources here.

jnehring commented 8 years ago

Thanks @ghsnd . I wanted to create the heap dump today but the too many open files bug happened again and when it happened I cannot create the heap dump. Will try again...

jnehring commented 8 years ago

I created a new memory dump, in a state when the system had 3506 open files and was shortly before crashing. However the latest open streams that @ghsnd and me fixed here are still not installed on freme-live. So next week after the Semantics I want to release and store the latest state on freme-live because maybe the bug has already bin fixed.

jnehring commented 8 years ago

i added a script to the crontab of freme live that restarts freme live every midnight as a temporary solution to the problem.

jnehring commented 8 years ago

I installed a new version of the broker and freme ner and switched off the cronjob to restart freme. lets keep our fingers crossed that this solves the issue

jnehring commented 8 years ago

About 20 hours after the release the number of open files was 2300. So this problem is still not fixed. I switched on script to restart FREME every midnight.

ghsnd commented 8 years ago

Today I found this, since it's the number of 'pipes' that increases... Soon to be checked.

ghsnd commented 8 years ago

A dump of the threads reveals a lot of this kind of threads (and their number grows over time):

"I/O dispatcher 480" #639 prio=5 os_prio=0 tid=0x00007fa9b40f7800 nid=0x634b runnable [0x00007fa92ced1000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x000000076f614530> (a sun.nio.ch.Util$2)
        - locked <0x000000076f614520> (a java.util.Collections$UnmodifiableSet)
        - locked <0x000000076f6143e8> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:256)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:105)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:584)
        at java.lang.Thread.run(Thread.java:745)

This doesn't say much, but AbstractMultiworkerIOReactor is used in CloseableHttpAsyncClientBase, which is used in the Spring MVC framework and in the Spring HTTP client. To be investigated further...

jnehring commented 8 years ago

I read this article about JDBC database connections in high concurrency scenarios and therefore added this configuration to freme live. Maybe the open connections are database connections. So now there is a maximum number of open database connections and further no database connection will be alive for more then 10 seconds.

spring.datasource.max-active=200
spring.datasource.remove-abandoned=true
spring.datasource.remove-abandoned-timeout=10
spring.datasource.log-abandoned=true