Open denjones opened 8 years ago
when the task failed/retried, you can go to active tasks and task detail page to view the reason.
Most failed/retried task are "Connection timed out".
I removed the 'Connection': 'keep-alive', however, CLOSE_WAIT count is still getting high. I check the detail of retried tasks and found that most task are still sending 'Connection': 'keep-alive' header. So how can I reload crawling config of the pending tasks?
connection timeout with local squid? is squid running normally? If you have a list of the old urls, submit them again.
Squid is running normally. I think the connections timeout because it is running out of connection resources and the time waiting for connection resources exceeds 120s.
Then I should go through taskdb and remove all task with 'Connection': 'keep-alive' header?
And one more question: what is the header 'Connection': 'keep-alive' supposed to act like in pyspider fetchers? In the browser, this header may save connection resources by reusing the connection to the same host, and pyspider seems not reusing it and keeping it open?
I cannot confirm keep-alive header cause the issue. And fetcher should not running out of connection resources, here is a poolsize parameter in fetcher, when it's running out, it will not accept new task.
You can stop the project with keep-alive and see if it solve the problem.
Well it seems that keep-alive header is part of the reason. I stop all old project with keep-alive header and leave only new projects running without that header and CLOSE_WAIT count stays around 200. But the new projects has a high success rate, and there aren't so many failed task as old project. I don't know if failed/retried tasks coursed CLOSE_WAIT connection or just the keep-alive header coursed it.
Found an old thread of libcurl, seems it's using connection cache to maintain the connections when keepalive enabled. http://comments.gmane.org/gmane.comp.web.curl.library/25399
My guessing is, keep-alive header is not set by appropriate API, libcurl may not be able to maintain the connections correctly, when try to reuse them, the states are not what it expect.
As pyspider is using tornado.http_client, not using libcurl directly, and not willing to control the connection process, remove "dangerous" headers, e.g. connection, content-length would be a workaround.
Sadly, CLOSE_WAIT count still getting high. I am giving up and will take the way that I restart all fetcher processes hourly.
Pyspider meets ten of thousands of CLOSE_WAIT connections in my machine and leads to a lot of timeout failed tasks.
Since I am not so familiar with TCP connection, I tried to modify squid settings and sysctl to promote max connection count, but didn't gonna help. Finally, I searched for this and got:
Which means that pyspider received a FIN from squid proxy but holds the connection open. I restart all pyspider processes and the CLOSE_WAIT connections are all gone.