scrapy-selenium is yielding normal scrapy.Request instead of SeleniumRequest

clemfromspace / scrapy-selenium

Scrapy middleware to handle javascript pages using selenium

Do What The F*ck You Want To Public License

921 stars 353 forks source link

scrapy-selenium is yielding normal scrapy.Request instead of SeleniumRequest #78

Open iamumairayub opened 4 years ago

iamumairayub commented 4 years ago

@clemfromspace I just decided to use your package in my Scrapy project but it is just yielding normal scrapy.Requuest instead of SeleniumRequest

from shutil import which
from scrapy_selenium import SeleniumRequest
from scrapy.contracts import Contract
class WithSelenium(Contract):
    """ Contract to set the request class to be SeleniumRequest for the current call back method to test
    @with_selenium
    """
    name = 'with_selenium'
    request_cls = SeleniumRequest

class WebsiteSpider(BaseSpider):
    name = 'Website'

    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
             'scrapy_selenium.SeleniumMiddleware': 800
        },
        'SELENIUM_DRIVER_NAME': 'firefox',
        'SELENIUM_DRIVER_EXECUTABLE_PATH': which('geckodriver'),
        'SELENIUM_DRIVER_ARGUMENTS': ['-headless']  
    }

    def start_requests(self):
        yield SeleniumRequest(url=url, 
            callback=self.parse_result)

    def parse_result(self, response):
        """
        @with_selenium
        """
        print(response.request.meta['driver'].title)     --> gives key error

I have seen this issue but this is not helpful at all

heysamtexas commented 3 years ago

I plus-oned this and then solved it for myself a little later.

For me this is not in the context of testing, so I have no need for contracts (at least as far as understand it).

My solve was the following:

Override start_requests() (as you have done)
yield SeleniumRequest() in parse_result. I notice that you use parse_result() instead of parse()

Once I did this it started working. My solution snippet:

    def start_requests(self):
        cls = self.__class__
        if not self.start_urls and hasattr(self, 'start_url'):
            raise AttributeError(
                "Crawling could not start: 'start_urls' not found "
                "or empty (but found 'start_url' attribute instead, "
                "did you miss an 's'?)")
        for url in self.start_urls:
            yield SeleniumRequest(url=url, dont_filter=True)

    def parse(self, response):
        le = LinkExtractor()
        for link in le.extract_links(response):
            yield SeleniumRequest(
                url=link.url,
                callback=self.parse_result
            )

    def parse_result(self, response):
        page = PageItem()
        page['url'] = response.url
        yield page

educatron commented 3 years ago

Hey @undernewmanagement

I tried your snippet but the links in LinkExtractor are not processed correctly (response body is not text).

    rules = ( Rule(LinkExtractor(restrict_xpaths=(['//*[@id="breadcrumbs"]'])), follow=True),) 

    def start_requests(self):
        for url in self.start_urls:
            yield SeleniumRequest(url=url, dont_filter=True,)

    def parse_start_url(self, response):
        return self.parse_result(response)

    def parse(self, response):
        le = LinkExtractor()
        for link in le.extract_links(response):
            yield SeleniumRequest(url=link.url, callback=self.parse_result,)

    def parse_result(self, response):
        page = PageItem()
        page['url'] = response.url
        yield page

I had to use parse_start_url to assign the parse_result callback to start urls.

Do you know what the problem could be? I'm new in Scrapy and Python.

Thanks!

heysamtexas commented 3 years ago

Hey @educatron thanks for the question - let's not hijack the thread here. I think you should take that question directly to the scrapy community. https://scrapy.org/community/

educatron commented 3 years ago

Ok. Thanks!