binux / pyspider

A Powerful Spider(Web Crawler) System in Python.
http://docs.pyspider.org/
Apache License 2.0
16.51k stars 3.69k forks source link

Meets a lot of CLOSE_WAIT #456

Open denjones opened 8 years ago

denjones commented 8 years ago

Pyspider meets ten of thousands of CLOSE_WAIT connections in my machine and leads to a lot of timeout failed tasks.

image

image

Since I am not so familiar with TCP connection, I tried to modify squid settings and sysctl to promote max connection count, but didn't gonna help. Finally, I searched for this and got:

CLOSE_WAIT means that the local end of the connection has received a FIN from the other end, but the OS is waiting for the program at the local end to actually close its connection.

The problem is your program running on the local machine is not closing the socket. It is not a TCP tuning issue. A connection can (and quite correctly) stay in CLOSE_WAIT forever while the program holds the connection open.

Which means that pyspider received a FIN from squid proxy but holds the connection open. I restart all pyspider processes and the CLOSE_WAIT connections are all gone.

binux commented 8 years ago
  1. What pyspider version you are using?
  2. Did you modified fetcher?
  3. How many fetcher poolsize?
  4. Are this fetch failed/pending from the dashboard of pyspider?
  5. Did you pass keep-alive header to it?
denjones commented 8 years ago
  1. 0.3.7
  2. no
  3. default
  4. when CLOSE_WAIT count got high, tasks will almost become retrying/failed tasks in the dashboard, almost all in red and yellow, with only a little blue and green. They turn green after I restart all fetcher processes. After some hours, CLOSE_WAIT count would get high again, and I should restart fetchers again and again.
  5. Oooooh, I set some 'Connection': 'keep-alive' in headers, I will remove it and see if it works.
binux commented 8 years ago

when the task failed/retried, you can go to active tasks and task detail page to view the reason.

denjones commented 8 years ago

Most failed/retried task are "Connection timed out".

I removed the 'Connection': 'keep-alive', however, CLOSE_WAIT count is still getting high. I check the detail of retried tasks and found that most task are still sending 'Connection': 'keep-alive' header. So how can I reload crawling config of the pending tasks?

binux commented 8 years ago

connection timeout with local squid? is squid running normally? If you have a list of the old urls, submit them again.

denjones commented 8 years ago

Squid is running normally. I think the connections timeout because it is running out of connection resources and the time waiting for connection resources exceeds 120s.

Then I should go through taskdb and remove all task with 'Connection': 'keep-alive' header?

And one more question: what is the header 'Connection': 'keep-alive' supposed to act like in pyspider fetchers? In the browser, this header may save connection resources by reusing the connection to the same host, and pyspider seems not reusing it and keeping it open?

binux commented 8 years ago

I cannot confirm keep-alive header cause the issue. And fetcher should not running out of connection resources, here is a poolsize parameter in fetcher, when it's running out, it will not accept new task.

You can stop the project with keep-alive and see if it solve the problem.

denjones commented 8 years ago

Well it seems that keep-alive header is part of the reason. I stop all old project with keep-alive header and leave only new projects running without that header and CLOSE_WAIT count stays around 200. But the new projects has a high success rate, and there aren't so many failed task as old project. I don't know if failed/retried tasks coursed CLOSE_WAIT connection or just the keep-alive header coursed it.

binux commented 8 years ago

Found an old thread of libcurl, seems it's using connection cache to maintain the connections when keepalive enabled. http://comments.gmane.org/gmane.comp.web.curl.library/25399

My guessing is, keep-alive header is not set by appropriate API, libcurl may not be able to maintain the connections correctly, when try to reuse them, the states are not what it expect.

As pyspider is using tornado.http_client, not using libcurl directly, and not willing to control the connection process, remove "dangerous" headers, e.g. connection, content-length would be a workaround.

denjones commented 8 years ago

Sadly, CLOSE_WAIT count still getting high. I am giving up and will take the way that I restart all fetcher processes hourly.