Closed migalkin closed 4 years ago
That happens when the server becomes overloaded. In order to avoid this, I recommend:
ldf-server.config.json 3000 4
to start the server on port 3000 with 4 threads)TPF has been designed with caching in mind, so a caching server is a must for a good comparison.
For your information, here is the cache config from fragments.dbpedia.org
for NGINX.
I'm getting the same error
events.js:183
throw er; // Unhandled 'error' event
^
Error: socket hang up
at createHangUpError (_http_client.js:331:15)
at Socket.socketCloseListener (_http_client.js:363:23)
at emitOne (events.js:121:20)
at Socket.emit (events.js:211:7)
at TCP._handle.close [as _onclose] (net.js:554:12)
Server is running on port 3000 with 4 threads on a machine with 4 CPU and 16 GB memory. Caching with Apache is enabled (and working).
This happens always after a long client runtime of 2h or so. I'm always inspecting memory and CPU usage but I cannot see any overloading (I have only one machine with server, client and other processes running parallel). But looking at the logging output it seems that over time requests are made slower and slower.
Do you see any evidence of a high number of open connections?
I guess
> netstat -s
[...]
Tcp:
217278 active connections openings
189717 passive connection openings
649 failed connection attempts
1114 connection resets received
28 connections established
273149138 segments received
274075086 segments send out
396620 segments retransmited
0 bad segments received.
5671 resets sent
[...]
The number of active/passive connections openings are increasing while client is working.
^ that's it. This right there is the main issue for this problem. Now what I want to figure out is whether these connections are between Apache and the LDF server, or the client and Apache. Any insight there?
How can I check this?
Is your client running on a different machine? If so, keep an eye on the connections of that machine.
Otherwise, I wonder whether Apache can give you stats about this. A fullnetstat
view should also show the from and to of the connections.
Client is running on the same machine. Is this helpful?
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
tcp 0 0 10.44.3.57:22 10.96.15.110:50120 VERBUNDEN 10438/sshd: prod [p keepalive (6194,05/0/0)
tcp 0 0 10.44.3.57:22 10.96.15.110:57748 VERBUNDEN 13048/sshd: prod [p keepalive (1000,32/0/0)
tcp 1 0 10.44.3.57:42800 10.44.3.57:3000 CLOSE_WAIT 28575/apache2 aus (0.00/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55014 TIME_WAIT - timewait (17,10/0/0)
tcp 1 0 10.44.3.57:42806 10.44.3.57:3000 CLOSE_WAIT 28572/apache2 aus (0.00/0/0)
tcp 0 0 10.44.3.57:42824 10.44.3.57:3000 VERBUNDEN 28567/apache2 aus (0.00/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55012 TIME_WAIT - timewait (12,81/0/0)
tcp 0 36 10.44.3.57:22 10.96.15.110:60343 VERBUNDEN 14345/sshd: prod [p ein (0,31/0/0)
tcp 0 0 10.44.3.57:22 10.96.15.110:51483 VERBUNDEN 10939/sshd: prod [p keepalive (2179,97/0/0)
tcp 0 0 10.44.3.57:42822 10.44.3.57:3000 VERBUNDEN 28581/apache2 aus (0.00/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55045 VERBUNDEN 28567/apache2 keepalive (7193,47/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55033 TIME_WAIT - timewait (39,21/0/0)
tcp 0 0 10.44.3.57:55039 10.44.3.57:80 VERBUNDEN 29662/node keepalive (0,40/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55042 VERBUNDEN 29690/apache2 keepalive (7193,47/0/0)
tcp 0 0 10.44.3.57:22 10.96.15.110:60344 VERBUNDEN 14348/sshd: prod [p keepalive (246,66/0/0)
tcp 1 0 10.44.3.57:42811 10.44.3.57:3000 CLOSE_WAIT 28566/apache2 aus (0.00/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55023 TIME_WAIT - timewait (24,08/0/0)
tcp 0 0 10.44.3.57:22 10.96.15.110:51479 VERBUNDEN 10937/sshd: prod [p keepalive (2573,18/0/0)
tcp 0 0 10.44.3.57:22 10.96.15.110:60014 VERBUNDEN 14042/sshd: prod [p keepalive (6570,88/0/0)
tcp 1 0 10.44.3.57:42812 10.44.3.57:3000 CLOSE_WAIT 29693/apache2 aus (0.00/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55036 TIME_WAIT - timewait (44,60/0/0)
tcp 0 0 10.44.3.57:42818 10.44.3.57:3000 VERBUNDEN 28199/apache2 aus (0.00/0/0)
tcp 0 0 10.44.3.57:42814 10.44.3.57:3000 VERBUNDEN 29683/apache2 aus (0.00/0/0)
tcp 0 0 10.44.3.57:55045 10.44.3.57:80 VERBUNDEN 29662/node keepalive (0,52/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55024 TIME_WAIT - timewait (24,25/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55020 TIME_WAIT - timewait (19,50/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55031 TIME_WAIT - timewait (38,73/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55035 TIME_WAIT - timewait (46,38/0/0)
tcp 0 0 10.44.3.57:22 10.96.15.110:57777 VERBUNDEN 13179/sshd: prod [p keepalive (1262,46/0/0)
tcp 0 0 10.44.3.57:55047 10.44.3.57:80 VERBUNDEN 29662/node keepalive (0,53/0/0)
tcp 1 0 10.44.3.57:42801 10.44.3.57:3000 CLOSE_WAIT 29694/apache2 aus (0.00/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55041 VERBUNDEN 28199/apache2 keepalive (7193,47/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55047 VERBUNDEN 28581/apache2 keepalive (7193,47/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55039 VERBUNDEN 29683/apache2 keepalive (7193,47/0/0)
tcp 0 0 10.44.3.57:22 10.96.15.110:57750 VERBUNDEN 13050/sshd: prod [p keepalive (836,48/0/0)
tcp 0 0 10.44.3.57:22 10.96.15.110:60015 VERBUNDEN 14045/sshd: prod [p keepalive (6194,05/0/0)
tcp 1 0 10.44.3.57:42795 10.44.3.57:3000 CLOSE_WAIT 28408/apache2 aus (0.00/0/0)
tcp 0 0 10.44.3.57:55041 10.44.3.57:80 VERBUNDEN 29662/node keepalive (0,53/0/0)
tcp 1 0 10.44.3.57:42803 10.44.3.57:3000 CLOSE_WAIT 29684/apache2 aus (0.00/0/0)
tcp 1 0 10.44.3.57:42808 10.44.3.57:3000 CLOSE_WAIT 28414/apache2 aus (0.00/0/0)
tcp 0 0 10.44.3.57:55042 10.44.3.57:80 VERBUNDEN 29662/node keepalive (0,40/0/0)
tcp 0 0 10.44.3.57:42823 10.44.3.57:3000 VERBUNDEN 29690/apache2 aus (0.00/0/0)
tcp 0 0 10.44.3.57:22 10.96.15.110:57778 VERBUNDEN 13181/sshd: prod [p keepalive (803,71/0/0)
tcp 0 0 10.44.3.57:80 10.44.3.57:55028 TIME_WAIT - timewait (34,12/0/0)
tcp 0 0 10.44.3.57:22 10.96.15.110:50116 VERBUNDEN 10435/sshd: prod [p keepalive (6701,95/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42791 TIME_WAIT - timewait (40,27/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42806 FIN_WAIT2 - timewait (43,35/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42808 FIN_WAIT2 - timewait (43,70/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42801 FIN_WAIT2 - timewait (27,60/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42790 TIME_WAIT - timewait (37,34/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42814 VERBUNDEN 19130/node aus (0.00/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42822 VERBUNDEN 19130/node aus (0.00/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42812 FIN_WAIT2 - timewait (48,03/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42773 TIME_WAIT - timewait (24,77/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42811 FIN_WAIT2 - timewait (48,71/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42784 TIME_WAIT - timewait (24,50/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42824 VERBUNDEN 19136/node aus (0.00/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42800 FIN_WAIT2 - timewait (28,83/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42796 TIME_WAIT - timewait (45,80/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42774 TIME_WAIT - timewait (17,44/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42770 TIME_WAIT - timewait (20,61/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42817 TIME_WAIT - timewait (57,22/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42795 FIN_WAIT2 - timewait (23,20/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42820 TIME_WAIT - timewait (57,88/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42818 VERBUNDEN 19136/node aus (0.00/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42803 FIN_WAIT2 - timewait (38,42/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42799 TIME_WAIT - timewait (12,54/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42823 VERBUNDEN 19134/node aus (0.00/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42792 TIME_WAIT - timewait (15,49/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42771 TIME_WAIT - timewait (14,75/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42789 TIME_WAIT - timewait (43,67/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42762 TIME_WAIT - timewait (12,85/0/0)
tcp6 0 0 10.44.3.57:3000 10.44.3.57:42804 TIME_WAIT - timewait (47,92/0/0)
Thanks! I see lots of connections between Apache and the LDF server. Apache should recycle connections, I wonder why that's not happening, and I wonder whether the problem is on the Apache or the LDF server side.
Then maybe it is related to https://serverfault.com/questions/538988/apache-2-4-not-closing-connections (I'm running apache 2.4)
I wonder whether the client is to blame, i.e., whether the same thing would also occur if the identical series of requests was made through curl. We need to run some tests on this.
Looks like caching was my problem. After I disabled caching the error does not occur anymore. While filling the cache with a hundred thousands of requests, htcacheclean wasn't able to keep memory/disk space at a certain level (maybe because of a to high value for cache expiration):
[Tue Mar 06 10:26:20.790165 2018] [cache_disk:warn] [pid 28879] (28)No space left on device: [client 10.44.3.57:40287] AH00721: could not create vary file /srv/zdb/cache/aptmpQy5pny
This could mean that I was running out of inodes, which I suppose has something to do with CacheDirLength and CacheDirLevels.
Interesting. Thanks for sharing.
This project has now been deprecated in favor of Comunica, where this should not be a problem anymore. If it is, feel free to open a new issue there.
Still doing experiments on Fedbench on top of LDF HDT. The datasets are taken from the LDF website.
Now we observe sporadic crashes of the LDF client with a strange error 'Socket hang up'. For example, Linkedmdb endpoint:
Then, the unified all-in-one Fedbench:
What might be the source of the error?