fabienvauchelles / scrapoxy

Scrapoxy is a super proxy aggregator, allowing you to manage all proxies in one place 🎯, rather than spreading it across multiple scrapers πŸ•ΈοΈ. It also smartly handles traffic routing πŸ”€ to minimize bans and increase success rates πŸš€.
http://scrapoxy.io
MIT License
1.89k stars 232 forks source link

500 Internal Server Error with scrapy/splash/scrapoxy #222

Open devitdc opened 5 months ago

devitdc commented 5 months ago

Current Behavior

Hi, i use scrapy (2.8.0), scrapoxy (with docker image fabienvauchelles/scrapoxy:latest) and splash (3.5) to scrape data but i got a 500 Internal Server Error when splash is running. To illustrate the error I use the website https://quotes.toscrape.com/login

Scrapy is running on macos on host 192.168.0.12. Scrapoxy is running with docker image on debian 11.9 on host 192.168.0.103. Splash is running with docker image on debian 11.9 on host 192.168.0.102.

Scrapy settings.py configuration :

# Scrapoxy setup
CONCURRENT_REQUESTS_PER_DOMAIN = 1
RETRY_TIMES = 0

SCRAPOXY_MASTER = "http://192.168.0.103:8888"
SCRAPOXY_API = "http://192.168.0.103:8890/api"
SCRAPOXY_USERNAME = "username"
SCRAPOXY_PASSWORD = "password"

SCRAPOXY_BLACKLIST_HTTP_STATUS_CODES = [400, 429, 503]
SCRAPOXY_SLEEP_MIN = 60
SCRAPOXY_SLEEP_MAX = 180
# End Scrapoxy setup

# Splash setup
SPLASH_URL = 'http://192.168.0.102:8050'
# End Splash setup

USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

ROBOTSTXT_OBEY = False

SPIDER_MIDDLEWARES = {
    "scrapoxy.StickySpiderMiddleware": 101,
}

DOWNLOADER_MIDDLEWARES = {
    # scrapoxy middleware
    'scrapoxy.ProxyDownloaderMiddleware': 100,
    'scrapoxy.BlacklistDownloaderMiddleware': 101,
    ###################
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 300,
    ###################
    # splash middleware
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    ###################
}

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

Scrapy spider :

import scrapy
from scrapy_splash import SplashRequest

class SplashloginquotesSpider(scrapy.Spider):
    name = "splashLoginQuotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_url = "http://quotes.toscrape.com/login"
    lua_code = '''
    function main(splash, args)
        splash:init_cookies(splash.args.cookies)
        assert(splash:go(args.url))
        assert(splash:wait(2))
        assert(splash:set_viewport_full())

        form = splash:select('form[action="/login"]')
        token = splash:select('input[name="csrf_token"]').value
        values = {
            csrf_token = token,
            username = 'demo',
            password = 'demo'
        }
        assert(form:fill(values))
        assert(form:submit())
        assert(splash:wait(2))

        return {
            html = splash:html(),
            png = splash:png(),
            har = splash:har(),
            cookies = splash:get_cookies(),
        }
    end
    '''

    def start_requests(self):
        yield SplashRequest (
            url=self.start_url,
            callback = self.parse,
            endpoint="execute",
            args = {
                'width': 1000,
                'lua_source': self.lua_code,
                'url': self.start_url
            }
        )

    def parse(self, response):
        quotes = response.xpath("//div[@class='quote']") 

        for quote in quotes:
            quote_text = quote.xpath(".//span[@class='text']/text()").get()
            yield {
                'quote': quote_text
            }

Expected Behavior

Everything works with scrapy and scrapoxy. Everything works with scrapy and splash.

But the aim is to be able to use scrapy, scrapoxy and splash in the same scrapy project.

Steps to Reproduce

I use OVH Public Cloud with 6 proxies.

Failure Logs

Scrapoxy log :
ERROR [MasterService] request_error: socket hang up from proxy 133cbcd6-f593-4853-8469-14525945484c:5283b824-e59c-4bb0-b701-c4b291dad8ae (POST http://192.168.0.102:8050/execute)

Scrapy log :
2024-02-14 23:12:33 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://quotes.toscrape.com/login via http://192.168.0.102:8050/execute> (failed 1 times): 500 Internal Server Error
2024-02-14 23:12:33 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://quotes.toscrape.com/login via http://192.168.0.102:8050/execute> (referer: None)
2024-02-14 23:12:33 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://192.168.0.102:8050/execute>: HTTP status code is not handled or not allowed

Splash log :
2024-02-14 19:23:46.858428 [-] "192.168.0.12" - - [14/Feb/2024:19:23:45 +0000] "POST /execute HTTP/1.1" 400 311 "http://192.168.0.102:8050/info?wait=0.5&images=1&expand=1&timeout=90.0&url=http%3A%2F%2Fquotes.toscrape.com%2Flogin&lua_source=function+main%28splash%2C+args%29%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%281%29%29%0D%0A++assert%28splash%3Aset_viewport_full%28%29%29%0D%0A++%0D%0A++splash%3Aon_request%28function%28request%29%0D%0A++++++request%3Aset_proxy%7B%0D%0A++++++++host+%3D+%22192.168.0.102%22%2C%0D%0A++++++++port+%3D+8888%2C%0D%0A++++++++username+%3D+%27mik4rrlmtlfz8o4q0wk6y%27%2C%0D%0A++++++++password+%3D+%27xa49grzmg4qewtbknxxs17%27%2C%0D%0A++++++++type+%3D+%27http%27%0D%0A++++++%7D%0D%0A++end%29%0D%0A++%0D%0A++--+On+r%C3%A9cup%C3%A8re+le+formulaire+--%0D%0A++form+%3D+splash%3Aselect%28%27form%5Baction%3D%22%2Flogin%22%5D%27%29%0D%0A++--+On+r%C3%A9cup%C3%A8re+la+valeur+du+token+csrf+--%0D%0A++token+%3D+splash%3Aselect%28%27input%5Bname%3D%22csrf_token%22%5D%27%29.value%0D%0A++--+On+d%C3%A9finit+les+%C3%A9l%C3%A9ments+%C3%A0+soumettre+au+formulaire+--%0D%0A++values+%3D+%7B%0D%0A++++csrf_token+%3D+token%2C%0D%0A++++username+%3D+%27demo%27%2C%0D%0A++++password+%3D+%27demo%27%0D%0A++%7D%0D%0A++--+On+remplit+le+formulaire+avec+les+donn%C3%A9es+--%0D%0A++assert%28form%3Afill%28values%29%29%0D%0A++--+On+envoie+le+formulaire+au+serveur+pour+se+connecter+--%0D%0A++assert%28form%3Asubmit%28%29%29%0D%0A%0D%0A++assert%28splash%3Await%282%29%29%0D%0A++%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++png+%3D+splash%3Apng%28%29%2C%0D%0A++++har+%3D+splash%3Ahar%28%29%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

Scrapoxy Version

docker version

Custom Version

Deployment

Operating System

Storage

Additional Information

No response

fabienvauchelles commented 5 months ago

Ok thanks. I will try to reproduce.