MaxPensel / OWS-scrapy-wrapper

Ready-to-use scrapy wrapper targeting .json specification files as produced by OpenWebScraper
GNU General Public License v3.0
2 stars 0 forks source link

Crawler hängt vorübergehend fest #7

Open ipsilio opened 4 years ago

ipsilio commented 4 years ago

Hallo Max,

habe gerade (erneut) festgestellt, dass der Crawler manchmal am Ende des Crawls festhängt und wiederholt die Info bzgl. der "Crawled x pages (x pages/min)" ausgibt. Anbei eine Kopie des Logs. Grundsätzlich nicht tragisch, da der Crawl danach weitergeht, allerdings hat es in diesem Fall z.B. 10 Minuten gedauert.

open_web_scraper_1  | [crawlspider] 2020-04-14 13:41:18,027 - INFO - [parse_html] - Matched 1 paragraphs in http://www.uni-corvinus.hu/index.php?id=24978
open_web_scraper_1  | [crawlspider] 2020-04-14 13:41:18,629 - INFO - [parse_html] - Matched 1 paragraphs in http://www.uni-corvinus.hu/index.php?id=69140
open_web_scraper_1  | 2020-04-14 13:41:18,658 - INFO - Ignoring response <500 http://www.bcehok.hu/>: HTTP status code is not handled or not allowed - MainProcess httperror.py:55
open_web_scraper_1  | [crawlspider] 2020-04-14 13:41:19,198 - INFO - [parse_html] - Matched 1 paragraphs in http://www.uni-corvinus.hu/index.php?id=35072
open_web_scraper_1  | [crawlspider] 2020-04-14 13:41:19,268 - INFO - [parse_html] - Matched 1 paragraphs in http://www.uni-corvinus.hu/index.php?id=35071
open_web_scraper_1  | [crawlspider] 2020-04-14 13:41:20,677 - INFO - [parse_html] - Matched 1 paragraphs in http://www.uni-corvinus.hu/index.php?id=35069
open_web_scraper_1  | 2020-04-14 13:41:22,871 - INFO - Ignoring response <500 http://www.uni-corvinus.hu/index.php?id=35070>: HTTP status code is not handled or not allowed - MainProcess httperror.py:55
open_web_scraper_1  | 2020-04-14 13:41:30,471 - INFO - Crawled 211 pages (at 90 pages/min), scraped 178 items (at 80 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:42:30,470 - INFO - Crawled 211 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:43:30,470 - INFO - Crawled 212 pages (at 1 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:44:30,470 - INFO - Crawled 212 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:45:30,470 - INFO - Crawled 212 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:45:43,883 - INFO - Ignoring response <504 https://palyazatkezelo.uni-corvinus.hu/>: HTTP status code is not handled or not allowed - MainProcess httperror.py:55
open_web_scraper_1  | 2020-04-14 13:46:08,694 - ERROR - Error downloading <GET http://vir-portal.uni-corvinus.hu/robots.txt>: TCP connection timed out: 110: Connection timed out. - MainProcess robotstxt.py:95
open_web_scraper_1  | Traceback (most recent call last):
open_web_scraper_1  |   File "/usr/local/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
open_web_scraper_1  |     defer.returnValue((yield download_func(request=request, spider=spider)))
open_web_scraper_1  | twisted.internet.error.TCPTimedOutError: TCP connection timed out: 110: Connection timed out.
open_web_scraper_1  | 2020-04-14 13:46:30,470 - INFO - Crawled 213 pages (at 1 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:47:30,472 - INFO - Crawled 213 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
rabbitmq_1          | 2020-04-14 13:39:29.422 [info] <0.2594.0> connection <0.2594.0> (172.22.0.3:37128 -> 172.22.0.2:5672): user 'guest' authenticated and granted access to vhost '/'
open_web_scraper_1  | 2020-04-14 13:48:30,470 - INFO - Crawled 213 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:49:30,471 - INFO - Crawled 213 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:50:30,471 - INFO - Crawled 213 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:51:30,470 - INFO - Crawled 213 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:52:30,471 - INFO - Crawled 213 pages (at 0 pages/min), scraped 178 items (at 0 items/min) - MainProcess logstats.py:48
open_web_scraper_1  | 2020-04-14 13:52:42,011 - ERROR - Error downloading <GET http://vir-portal.uni-corvinus.hu/default.aspx> - MainProcess scraper.py:209
open_web_scraper_1  | Traceback (most recent call last):
open_web_scraper_1  |   File "/usr/local/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
open_web_scraper_1  |     defer.returnValue((yield download_func(request=request, spider=spider)))
open_web_scraper_1  | twisted.internet.error.TCPTimedOutError: TCP connection timed out: 110: Connection timed out.
open_web_scraper_1  | 2020-04-14 13:52:42,115 - INFO - Closing spider (finished) - MainProcess engine.py:296
open_web_scraper_1  | [crawlspider] 2020-04-14 13:52:42,116 - INFO -  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ CLOSING SPIDER www.uni-corvinus.hu_index.php ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
open_web_scraper_1  | 2020-04-14 13:52:42,126 - INFO - Dumping Scrapy stats:
open_web_scraper_1  | {'downloader/exception_count': 12,
open_web_scraper_1  |  'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 5,
open_web_scraper_1  |  'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 1,
open_web_scraper_1  |  'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 6,
open_web_scraper_1  |  'downloader/req - MainProcess statscollectors.py:47
open_web_scraper_1  | 2020-04-14 13:52:42,129 - INFO - Spider closed (finished) - MainProcess engine.py:327
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:42,160 - INFO - <shared.CrawlSpecification object at 0x7f23b1b9d790>
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:42,162 - INFO - Finalizing paragraph crawl ...
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:42,173 - INFO - data: {'raw': False, 'crawl': 'HU', 'url': 'www.uni-corvinus.hu_index.php', 'filename': 'www.uni-corvinus.hu_index.php'}
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:42,536 - INFO - Result size: 1787610
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:42,540 - INFO - One message required!
open_web_scraper_1  | 2020-04-14 13:52:43,040 - INFO - Trying to establish connection to rabbitmq at rabbitmq... - MainProcess establish_rmq_connection.py:48
open_web_scraper_1  | 2020-04-14 13:52:43,051 - INFO - Pika version 1.0.1 connecting to ('172.22.0.2', 5672) - MainProcess connection_workflow.py:179
open_web_scraper_1  | 2020-04-14 13:52:43,057 - INFO - Socket connected: <socket.socket fd=21, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.22.0.3', 37442), raddr=('172.22.0.2', 5672)> - MainProcess io_services_utils.py:345
open_web_scraper_1  | 2020-04-14 13:52:43,062 - INFO - Streaming transport linked up: (<pika.adapters.utils.io_services_utils._AsyncPlaintextTransport object at 0x7f23a99b54d0>, _StreamingProtocolShim: <SelectConnection PROTOCOL transport=<pika.adapters.utils.io_services_utils._AsyncPlaintextTransport object at 0x7f23a99b54d0> params=<ConnectionParamete - MainProcess connection_workflow.py:428
rabbitmq_1          | 2020-04-14 13:47:51.329 [error] <0.1999.0> closing AMQP connection <0.1999.0> (172.22.0.4:48590 -> 172.22.0.2:5672):
rabbitmq_1          | missed heartbeats from client, timeout: 300s
rabbitmq_1          | 2020-04-14 13:52:43.067 [info] <0.3401.0> accepting AMQP connection <0.3401.0> (172.22.0.3:37442 -> 172.22.0.2:5672)
open_web_scraper_1  | 2020-04-14 13:52:43,083 - INFO - AMQPConnector - reporting success: <SelectConnection OPEN transport=<pika.adapters.utils.io_services_utils._AsyncPlaintextTransport object at 0x7f23a99b54d0> params=<ConnectionParameters host=rabbitmq port=5672 virtual_host=/ ssl=False>> - MainProcess connection_workflow.py:293
open_web_scraper_1  | 2020-04-14 13:52:43,086 - INFO - AMQPConnectionWorkflow - reporting success: <SelectConnection OPEN transport=<pika.adapters.utils.io_services_utils._AsyncPlaintextTransport object at 0x7f23a99b54d0> params=<ConnectionParameters host=rabbitmq port=5672 virtual_host=/ ssl=False>> - MainProcess connection_workflow.py:725
open_web_scraper_1  | 2020-04-14 13:52:43,087 - INFO - Connection workflow succeeded: <SelectConnection OPEN transport=<pika.adapters.utils.io_services_utils._AsyncPlaintextTransport object at 0x7f23a99b54d0> params=<ConnectionParameters host=rabbitmq port=5672 virtual_host=/ ssl=False>> - MainProcess blocking_connection.py:454
open_web_scraper_1  | 2020-04-14 13:52:43,089 - INFO - Connection to rabbitmq established at rabbitmq with hearbeart interval of 300. - MainProcess establish_rmq_connection.py:54
open_web_scraper_1  | 2020-04-14 13:52:43,093 - INFO - Created channel=1 - MainProcess blocking_connection.py:1248
open_web_scraper_1  | 2020-04-14 13:52:43,201 - INFO - Message sent to queue results with routing key results and body {"raw": false, "crawl": "HU", "url": "www.uni-corvinus.hu_index.php", "filename": "www.uni-corvinus.hu_index.php", "data": "url;content;par_language;page_language;origin;depth\nhttp://www.uni-corvinus.hu/index.php?id=35058;\"\n\n\r\n\r\ - MainProcess producer.py:66
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:43,208 - INFO - Clearing directory result_data.
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:43,223 - INFO - Clearing directory result_logs.
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:43,258 - WARNING - Failed to delete result_logs/HU. Reason: [Errno 26] Text file busy: 'scrapy.log'
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:43,265 - WARNING - Failed to delete result_logs/scrapy.log. Reason: [Errno 26] Text file busy: 'result_logs/scrapy.log'
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:43,291 - WARNING - Failed to delete result_logs/scrapy_wrapper.log. Reason: [Errno 26] Text file busy: 'result_logs/scrapy_wrapper.log'
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:43,317 - WARNING - Failed to delete result_logs/www.uni-corvinus.hu_index.php.log. Reason: [Errno 26] Text file busy: 'result_logs/www.uni-corvinus.hu_index.php.log'
open_web_scraper_1  | [RemoteCrawlFinalizer] 2020-04-14 13:52:43,345 - INFO - Done finalizing paragraph crawl.
open_web_scraper_1  | 2020-04-14 13:52:43,359 - INFO - Scrapy worker finished: True - MainProcess task_consumer.py:31
open_web_scraper_1  | 2020-04-14 13:52:43,359 - INFO - Unsubscribing task queue. - MainProcess task_consumer.py:33
open_web_scraper_1  | 2020-04-14 13:52:43,392 - INFO - Finished processing task - MainProcess task_consumer.py:37

Viele Grüße Julian

MaxPensel commented 4 years ago

Hi Julian, taucht der Fehler konsistent auf, also kannst du ihn verlässlich reproduzieren, z.B. beim crawlen der gleichen Seite? Kannst du außerdem Probleme mit der Internetverbindung ausschließen? Mir ist noch nicht ganz klar ob das Problem im OWS-scrapy-wrapper liegt oder in scrapy selbst, und ob wir dann überhaupt genug in scrapy eingreifen können um den Fehler (wenn überhaupt einer vorliegt) zu beheben.

Kannst du feststellen ob dadurch Inhalte "übersprungen" werden, also in den crawldaten am Ende fehlen, oder handelt es sich mehr oder weniger nur um eine Pausierung?

Grüße, Max