cjmamo / kafka-web-console

A web console for Apache Kafka (retired)
Apache License 2.0
762 stars 246 forks source link

Kafka web console freezing/stopping or dying too frequently. #57

Open alecinvan opened 9 years ago

alecinvan commented 9 years ago

Kafka web console freezing/stopping or dying too frequently. I don't think it's a problem on the OS level. Seems to be a problem on the application level. I've already fixed open file handlers to 98000 for anybody and time_waits to 30s instead of the default 5 minutes.

From what I can see from the logs, it starts with play: [ESC[31merrorESC[0m] play - Cannot invoke the action, eventually got an error: java.lang.RuntimeException: Exception while executing statement : IO Exception: "java.io.IOException: Too many open files"; "/etc/kafka-web-console/play"; SQL statement: delete from offsetPoints where (offsetPoints.offsetHistoryId = ?) [90031-172] errorCode: 90031, sqlState: 90031

Caused by: java.lang.RuntimeException: Exception while executing statement : IO Exception: "java.io.IOException: Too many open files"; "/etc/kafka-web-console/play"; SQL statement: delete from offsetPoints where (offsetPoints.offsetHistoryId = ?) [90031-172] errorCode: 90031, sqlState: 90031 delete from offsetPoints

then this seems to cause socket connection errors: Caused by: java.io.IOException: Too many open files at java.io.UnixFileSystem.createFileExclusively(Native Method) ~[na:1.7.0_75] at java.io.File.createNewFile(File.java:1006) ~[na:1.7.0_75] at org.h2.store.fs.FilePathDisk.createTempFile(FilePathDisk.java:367) ~[h2.jar:1.3.172] at org.h2.store.fs.FileUtils.createTempFile(FileUtils.java:329) ~[h2.jar:1.3.172] at org.h2.engine.Database.createTempFile(Database.java:1529) ~[h2.jar:1.3.172] at org.h2.result.RowList.writeAllRows(RowList.java:90) ~[h2.jar:1.3.172] [ESC[36mdebugESC[0m] application - Getting partition leaders for topic topic-exist-test [ESC[36mdebugESC[0m] application - Getting partition leaders for topic topic-rep-3-test [ESC[36mdebugESC[0m] application - Getting partition leaders for topic PofApiTest [ESC[36mdebugESC[0m] application - Getting partition leaders for topic PofApiTest-2 [ESC[36mdebugESC[0m] application - Getting partition leaders for topic fileread [ESC[36mdebugESC[0m] application - Getting partition leaders for topic pageview [ESC[36mdebugESC[0m] application - Getting partition log sizes for topic topic-exist-test from partition leaders 10.100.71.42:9092, 10.100.71.42:9092, 10.100.71.42:9092, 10.100.71.42:9092, 10.100.71.42:9092, 10.100.71.42:9092, 10.100.71.42:9092, 10.100.71.42:9092 [ESC[33mwarnESC[0m] application - Could not connect to partition leader 10.100.71.42:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader 10.100.71.42:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader 10.100.71.42:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader 10.100.71.42:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader 10.100.71.42:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader 10.100.71.42:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader 10.100.71.42:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader 10.100.71.42:9092. Error message: Failed to open a socket. [ESC[36mdebugESC[0m] application - Getting partition offsets for topic topic-exist-test

-jar:9092, exemplary-birds:9092, voluminous-mass:9092 [ESC[33mwarnESC[0m] application - Could not connect to partition leader voluminous-mass:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader exemplary-birds:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader harmful-jar:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader voluminous-mass:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader exemplary-birds:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader harmful-jar:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader voluminous-mass:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader exemplary-birds:9092. Error message: Failed to open a socket. [ESC[36mdebugESC[0m] application - Getting partition offsets for topic PofApiTest [ESC[36mdebugESC[0m] application - Getting partition log sizes for topic topic-rep-3-test from partition leaders exemplary-birds:9092, voluminous-mass:9092, harmful-jar:9092, exemplary-birds:9092, voluminous-mass:9092, harmful-jar:9092, exemplary-birds:9092, voluminous-mass:9092 [ESC[36mdebugESC[0m] application - Getting partition log sizes for topic fileread from partition leaders voluminous-mass:9092, harmful-jar:9092, exemplary-birds:9092, voluminous-mass:9092, harmful-jar:9092, exemplary-birds:9092, voluminous-mass:9092, harmful-jar:9092 [ESC[33mwarnESC[0m] application - Could not connect to partition leader exemplary-birds:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader harmful-jar:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader voluminous-mass:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader exemplary-birds:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader harmful-jar:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader voluminous-mass:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader exemplary-birds:9092. Error message: Failed to open a socket. [ESC[33mwarnESC[0m] application - Could not connect to partition leader harmful-jar:9092. Error message: Failed to open a socket. [ESC[36mdebugESC[0m] application - Getting partition offsets for topic PofApiTest-2

Then this leads to time_wait on the monitoring box to the production server: 1 tcp6 0 0 10.100.68.48:35050 10.100.98.100:9092 TIME_WAIT 1 tcp6 0 0 10.100.68.48:35051 10.100.98.100:9092 TIME_WAIT 1 tcp6 0 0 10.100.68.48:35055 10.100.98.100:9092 TIME_WAIT 1 tcp6 0 0 10.100.68.48:35057 10.100.98.100:9092 TIME_WAIT 1 tcp6 0 0 10.100.68.48:35064 10.100.98.100:9092 TIME_WAIT 1 tcp6 0 0 10.100.68.48:35065 10.100.98.100:9092 TIME_WAIT 1 tcp6 0 0 10.100.68.48:35066 10.100.98.100:9092 TIME_WAIT 1 tcp6 0 0 10.100.68.48:35073 10.100.98.100:9092 TIME_WAIT 1 tcp6 0 0 10.100.68.48:35074 10.100.98.100:9092 TIME_WAIT 1 tcp6 0 0 10.100.68.48:35075 10.100.98.100:9092 TIME_WAIT 1 tcp6 0 0 10.100.68.48:35085 10.100.98.100:9092 TIME_WAIT 1 tcp6 0 0 10.100.68.48:35088 10.100.98.100:9092 TIME_WAIT 1 tcp6 0 0 10.100.68.48:35100 10.100.98.100:9092 TIME_WAIT 1 tcp6 0 0 10.100.68.48:35103 10.100.98.100:9092 TIME_WAIT

But that only lasts for about 30s to 1minute then supervisord seems to restart webconsole after these time_waits go way or the sockets and files are properly closed or they get flushed from either play/webconsole or kafka.