alecxe / scrapy-fake-useragent

Random User-Agent middleware based on fake-useragent
MIT License
686 stars 98 forks source link

Middleware completely preventing scraper from starting/crawling #29

Closed caffeinatedMike closed 4 years ago

caffeinatedMike commented 4 years ago

I'm using the following settings in order to only use the faker provider since it'll always generate a user agent string. However, my spider will not crawl at all now when these settings are applied. I keep seeing 180 second timeouts

Code

class RedactedSpider(scrapy.Spider):
    name = 'redacted'
    custom_settings = {
        # 'DOWNLOAD_DELAY': 0.075,
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
            'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
            'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
            'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
        },
        'FAKEUSERAGENT_PROVIDERS': [
            'scrapy_fake_useragent.providers.FakerProvider',
            'scrapy_fake_useragent.providers.FixedUserAgentProvider'
        ],
        'USER_AGENT': (
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
            'AppleWebKit/537.36 (KHTML, like Gecko) '
            'Chrome/78.0.3904.108 Safari/537.36'
        ),
        'FAKER_RANDOM_UA_TYPE': 'chrome',
        ........

Logs

2020-07-27 18:13:46 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: [redacted])
2020-07-27 18:13:46 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.8 (tags/v3.7.8:4b47a5b6ba, Jun 28 2020, 08:53:46) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.19041-SP0
2020-07-27 18:13:46 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-07-27 18:13:46 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': '[redacted]',
 'FEED_EXPORT_FIELDS': [TRUNCATED],
 'LOG_FILE': 'runtime.log',
 'NEWSPIDER_MODULE': '[redacted].spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['[redacted].spiders']}
2020-07-27 18:13:46 [scrapy.extensions.telnet] INFO: Telnet Password: [redacted]
2020-07-27 18:13:46 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-07-27 18:13:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware',
 'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-27 18:13:46 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-27 18:13:46 [scrapy.middleware] INFO: Enabled item pipelines:
['[redacted].pipelines.LocalFilesPipeline',
 '[redacted].pipelines.NutrientMappingPipeline',
 '[redacted].pipelines.PartitionedCsvPipeline']
2020-07-27 18:13:46 [scrapy.core.engine] INFO: Spider opened
2020-07-27 18:13:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-27 18:13:46 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-27 18:14:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-27 18:15:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-27 18:16:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-27 18:16:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.[redacted].com/robots.txt> (failed 1 times): User timeout caused connection failure: Getting https://www.[redacted].com/robots.txt took longer than 180.0 seconds..
alecxe commented 4 years ago

Hey @caffeinatedMike. I have not yet uploaded the latest updates to the PyPI, does this happen when you install the package directly from github master? Thanks.

caffeinatedMike commented 4 years ago

@alecxe I'll have to try it out in the morning since I've already shut down my workstation for the night. Are there any major updates that the PyPI version is missing that could cause this? Is this something that has been reported previously and fixed already?

Sorry if I missed any previous commits or issues that raise this problem, just stumbled upon this package today, so haven't had a moment to dive deep into its history, but I went for it over the other well-known Scrapy user agent rotator due to it extending an option that can be used in the event the online services aren't available.

alecxe commented 4 years ago

Just uploaded 1.3.0 to PyPI and, yes, it includes the changes to support multiple User-Agent providers via FAKEUSERAGENT_PROVIDERS. Please see if you experience the same problems after upgrading to 1.3.0