Crawler hängt vorübergehend fest

Hallo Max,
habe gerade (erneut) festgestellt, dass der Crawler manchmal am Ende des Crawls festhängt und wiederholt die Info bzgl. der "Crawled x pages (x pages/min)" ausgibt. Anbei eine Kopie des Logs. Grundsätzlich nicht tragisch, da der Crawl danach weitergeht, allerdings hat es in diesem Fall z.B. 10 Minuten gedauert.
open_web_scraper_1  | [crawlspider] 2020-04-14 13:41:18,027 - INFO - [parse_html] - Matched 1 paragraphs in http://www.uni-corvinus.hu/index.php?id=24978
open_web_scraper_1  | [crawlspider] 2020-04-14 13:41:18,629 - INFO - [parse_html] - Matched 1 paragraphs in http://www.uni-corvinus.hu/index.php?id=69140
open_web_scraper_1  | 2020-04-14 13:41:18,658 - INFO - Ignoring response <500 http://www.bcehok.hu/>: HTTP status code is not handled or not allowed - MainProcess httperror.py:55
open_web_scraper_1  | [crawlspider] 2020-04-14 13:41:19,198 - INFO - [parse_html] - Matched 1 paragraphs in http://www.uni-corvinus.hu/index.php?id=35072
open_web_scraper_1  | [crawlspider] 2020-04-14 13:41:19,268 - INFO - [parse_html] - Matched 1 paragraphs in http://www.uni-corvinus.hu/index.php?id=35071
open_web_scraper_1  | [crawlspider] 2020-04-14 13:41:20,677 - INFO - [parse_html] - Matched 1 paragraphs in http://www.uni-corvinus.hu/index.php?id=35069
open_web_scraper_1  | 2020-04-14 13:41:22,871 - INFO - Ignoring response <500 http://www.uni-corvinus.hu/index.php?id=35070>: HTTP status code is not handled or not allowed - MainProcess httperror.py:55
open_web_scraper_1  | 2020-04-14 13:41:30,471 - INFO - Crawled 211 pages (at 90 pages/min), scraped 178 items (at 80 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:42:30,470 - INFO - Crawled 211 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:43:30,470 - INFO - Crawled 212 pages (at 1 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:44:30,470 - INFO - Crawled 212 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:45:30,470 - INFO - Crawled 212 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:45:43,883 - INFO - Ignoring response <504 https://palyazatkezelo.uni-corvinus.hu/>: HTTP status code is not handled or not allowed - MainProcess httperror.py:55
open_web_scraper_1  | 2020-04-14 13:46:08,694 - ERROR - Error downloading <GET http://vir-portal.uni-corvinus.hu/robots.txt>: TCP connection timed out: 110: Connection timed out. - MainProcess robotstxt.py:95
open_web_scraper_1  | Traceback (most recent call last):
open_web_scraper_1  |   File "/usr/local/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
open_web_scraper_1  |     defer.returnValue((yield download_func(request=request, spider=spider)))
open_web_scraper_1  | twisted.internet.error.TCPTimedOutError: TCP connection timed out: 110: Connection timed out.
open_web_scraper_1  | 2020-04-14 13:46:30,470 - INFO - Crawled 213 pages (at 1 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:47:30,472 - INFO - Crawled 213 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
rabbitmq_1          | 2020-04-14 13:39:29.422 [info] <0.2594.0> connection <0.2594.0> (172.22.0.3:37128 -> 172.22.0.2:5672): user 'guest' authenticated and granted access to vhost '/'
open_web_scraper_1  | 2020-04-14 13:48:30,470 - INFO - Crawled 213 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:49:30,471 - INFO - Crawled 213 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:50:30,471 - INFO - Crawled 213 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:51:30,470 - INFO - Crawled 213 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:52:30,471 - INFO - Crawled 213 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:52:42,011 - ERROR - Error downloading <GET http://vir-portal.uni-corvinus.hu/default.aspx> - MainProcess scraper.py:209
open_web_scraper_1  | Traceback (most recent call last):
open_web_scraper_1  |   File "/usr/local/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
open_web_scraper_1  |     defer.returnValue((yield download_func(request=request, spider=spider)))
open_web_scraper_1  | twisted.internet.error.TCPTimedOutError: TCP connection timed out: 110: Connection timed out.
open_web_scraper_1  | 2020-04-14 13:52:42,115 - INFO - Closing spider (finished) - MainProcess engine.py:296
open_web_scraper_1  | [crawlspider] 2020-04-14 13:52:42,116 - INFO -  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ CLOSING SPIDER www.uni-corvinus.hu_index.php ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
open_web_scraper_1  | 2020-04-14 13:52:42,126 - INFO - Dumping Scrapy stats:
open_web_scraper_1  | {'downloader/exception_count': 12,
open_web_scraper_1  |  'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 5,
open_web_scraper_1  |  'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 1,
open_web_scraper_1  |  'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 6,
open_web_scraper_1  |  'downloader/req - MainProcess statscollectors.py:47
open_web_scraper_1  | 2020-04-14 13:52:42,129 - INFO - Spider closed (finished) - MainProcess engine.py:327
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:42,160 - INFO - <shared.CrawlSpecification object at 0x7f23b1b9d790>
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:42,162 - INFO - Finalizing paragraph crawl ...
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:42,173 - INFO - data: {'raw': False, 'crawl': 'HU', 'url': 'www.uni-corvinus.hu_index.php', 'filename': 'www.uni-corvinus.hu_index.php'}
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:42,536 - INFO - Result size: 1787610
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:42,540 - INFO - One message required!
open_web_scraper_1  | 2020-04-14 13:52:43,040 - INFO - Trying to establish connection to rabbitmq at rabbitmq... - MainProcess establish_rmq_connection.py:48
open_web_scraper_1  | 2020-04-14 13:52:43,051 - INFO - Pika version 1.0.1 connecting to ('172.22.0.2', 5672) - MainProcess connection_workflow.py:179
open_web_scraper_1  | 2020-04-14 13:52:43,057 - INFO - Socket connected: <socket.socket fd=21, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.22.0.3', 37442), raddr=('172.22.0.2', 5672)> - MainProcess io_services_utils.py:345
open_web_scraper_1  | 2020-04-14 13:52:43,062 - INFO - Streaming transport linked up: (<pika.adapters.utils.io_services_utils._AsyncPlaintextTransport object at 0x7f23a99b54d0>, _StreamingProtocolShim: <SelectConnection PROTOCOL transport=<pika.adapters.utils.io_services_utils._AsyncPlaintextTransport object at 0x7f23a99b54d0> params=<ConnectionParamete - MainProcess connection_workflow.py:428
rabbitmq_1          | 2020-04-14 13:47:51.329 [error] <0.1999.0> closing AMQP connection <0.1999.0> (172.22.0.4:48590 -> 172.22.0.2:5672):
rabbitmq_1          | missed heartbeats from client, timeout: 300s
rabbitmq_1          | 2020-04-14 13:52:43.067 [info] <0.3401.0> accepting AMQP connection <0.3401.0> (172.22.0.3:37442 -> 172.22.0.2:5672)
open_web_scraper_1  | 2020-04-14 13:52:43,083 - INFO - AMQPConnector - reporting success: <SelectConnection OPEN transport=<pika.adapters.utils.io_services_utils._AsyncPlaintextTransport object at 0x7f23a99b54d0> params=<ConnectionParameters host=rabbitmq port=5672 virtual_host=/ ssl=False>> - MainProcess connection_workflow.py:293
open_web_scraper_1  | 2020-04-14 13:52:43,086 - INFO - AMQPConnectionWorkflow - reporting success: <SelectConnection OPEN transport=<pika.adapters.utils.io_services_utils._AsyncPlaintextTransport object at 0x7f23a99b54d0> params=<ConnectionParameters host=rabbitmq port=5672 virtual_host=/ ssl=False>> - MainProcess connection_workflow.py:725
open_web_scraper_1  | 2020-04-14 13:52:43,087 - INFO - Connection workflow succeeded: <SelectConnection OPEN transport=<pika.adapters.utils.io_services_utils._AsyncPlaintextTransport object at 0x7f23a99b54d0> params=<ConnectionParameters host=rabbitmq port=5672 virtual_host=/ ssl=False>> - MainProcess blocking_connection.py:454
open_web_scraper_1  | 2020-04-14 13:52:43,089 - INFO - Connection to rabbitmq established at rabbitmq with hearbeart interval of 300. - MainProcess establish_rmq_connection.py:54
open_web_scraper_1  | 2020-04-14 13:52:43,093 - INFO - Created channel=1 - MainProcess blocking_connection.py:1248
open_web_scraper_1  | 2020-04-14 13:52:43,201 - INFO - Message sent to queue results with routing key results and body {"raw": false, "crawl": "HU", "url": "www.uni-corvinus.hu_index.php", "filename": "www.uni-corvinus.hu_index.php", "data": "url;content;par_language;page_language;origin;depth\nhttp://www.uni-corvinus.hu/index.php?id=35058;\"\n\n\r\n\r\ - MainProcess producer.py:66
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:43,208 - INFO - Clearing directory result_data.
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:43,223 - INFO - Clearing directory result_logs.
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:43,258 - WARNING - Failed to delete result_logs/HU. Reason: [Errno 26] Text file busy: 'scrapy.log'
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:43,265 - WARNING - Failed to delete result_logs/scrapy.log. Reason: [Errno 26] Text file busy: 'result_logs/scrapy.log'
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:43,291 - WARNING - Failed to delete result_logs/scrapy_wrapper.log. Reason: [Errno 26] Text file busy: 'result_logs/scrapy_wrapper.log'
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:43,317 - WARNING - Failed to delete result_logs/www.uni-corvinus.hu_index.php.log. Reason: [Errno 26] Text file busy: 'result_logs/www.uni-corvinus.hu_index.php.log'
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:43,345 - INFO - Done finalizing paragraph crawl.
open_web_scraper_1  | 2020-04-14 13:52:43,359 - INFO - Scrapy worker finished: True - MainProcess task_consumer.py:31
open_web_scraper_1  | 2020-04-14 13:52:43,359 - INFO - Unsubscribing task queue. - MainProcess task_consumer.py:33
open_web_scraper_1  | 2020-04-14 13:52:43,392 - INFO - Finished processing task - MainProcess task_consumer.py:37
Viele Grüße Julian
MaxPensel / OWS-scrapy-wrapper

Crawler hängt vorübergehend fest #7