Scrapy: Returning a `Request` object in the `Spider.parse` method does not work - Githubissues

apify / actor-templates

This project is the :house: home of Apify actor template projects to help users quickly get started.

https://apify.com/

25 stars 14 forks source link

Scrapy: Returning a `Request` object in the `Spider.parse` method does not work #218

Closed vdusek closed 10 months ago

vdusek commented 11 months ago

Problem description

Scrapy users have the option of returning a scrapy.http.Request object from the scrapy.Spider.parse method. The returned Request will be parsed later. Also, a callback function can be provided to the Request constructor, to specify an alternative function for its processing.

However, if the Request is returned, it won't be processed in case of executing it on the Apify platform. Because it does not go through the Request Queue. We would have to use RequestQueue.add_request to prepare a Request for later processing.

How to replicate it

Use Python Scrapy Actor template and the following stem & spider code:
```
# src/items.py
```

from scrapy import Field, Item

class TitleItem(Item): """ Represents a title item scraped from a web page. """

url = Field()
title = Field()
parsed_by = Field()


```python
# src/spiders/title.py

from typing import Generator, Union
from urllib.parse import urljoin

from scrapy import Request, Spider
from scrapy.responsetypes import Response

from apify import Actor

from ..items import TitleItem

class TitleSpider(Spider):

    name = 'title'
    start_urls = ['https://apify.com']

    def parse(self, response: Response) -> Generator[Union[TitleItem, Request], None, None]:
        """
        Parse the web page response.

        Args:
            response: The web page response.

        Yields:
            Yields scraped TitleItem and Requests for links.
        """
        Actor.log.info(f'TitleSpider is parsing {response}...')

        # Extract and yield the TitleItem
        url = response.url
        title = response.css('title::text').extract_first()
        yield TitleItem(url=url, title=title, parsed_by='parse')

        # Extract all links from the page, create Requests out of them, and yield them
        for link_href in response.css('a::attr("href")'):
            link_url = urljoin(response.url, link_href.get())
            if link_url.startswith(('http://', 'https://')):
                yield Request(link_url, callback=self.second_parse)

    def second_parse(self, response: Response) -> Generator[Union[TitleItem, Request], None, None]:
        Actor.log.info(f"TitleSpider's second_parse is parsing {response}...")
        # Extract and yield the TitleItem
        url = response.url
        title = response.css('title::text').extract_first()
        yield TitleItem(url=url, title=f'{title}', parsed_by='second_parse')

Execute the project just using Scrapy, it will work:
```
scrapy crawl title
```
Execute the project using Apify, it will not work:
```
apify run --purge
```

Tasks

Note that more problems were included in this issue, see the comments below.
[x] Returning of a Request object in the Spider.parse method works - the Request will be scheduled.
[x] Spider class attribute start_urls works.
[x] All the fields of the Scrapy Request object are stored somewhere (and retrieved later) during the process of Scrapy Request <--> Apify Request transformation.

vdusek commented 10 months ago

Let's solve the problem of the Spider.start_requests() method (https://github.com/scrapy/scrapy/blob/2.11/scrapy/spiders/__init__.py#L68:L76) as part of this issue since it's quite connected. So I'm gonna increase the estimation.

vdusek commented 10 months ago

There is also a problem with the returning of a Request object in the Spider.parse with the callback. Or to be more precise any other parameters. They are lost during the transformation process from a Scrapy Request into an Apify Request. On top of that, some Request fields cannot even be stored in the Request Queue (e.g. callback). Let's try to solve this problem as part of this issue as well.

ThibaultBouveur commented 10 months ago

@vdusek Thanks a lot for working on this bug. I tested my code again and I don't think it works despite the ticked completed. Any thing I am missing?

vdusek commented 10 months ago

@ThibaultBouveur I tried it again with the code from the issue description ("How to replicate it" section) and it works for me. Are you using the new version of the python-scrapy template? If so, could you provide me a code that I can replicate it? Thanks :slightly_smiling_face:.

ThibaultBouveur commented 10 months ago

@vdusek I think I messed up with the new template, I will retry with the new template. The "npx apify-cli create" command will use the new template, right? Thanks a lot

vdusek commented 10 months ago

@ThibaultBouveur Yeah, just install Apify CLI (check the docs), run apify create, provide the name of your Actor, select Python and Scrapy. Then copy your scrapy files into the template and it should be ready to go :slightly_smiling_face:.

ThibaultBouveur commented 10 months ago

Ok so I tried something with a 3rd parse (see the code below), and with your fix, the second callback works but not the third one. Is using callbacks this way not the good methodology?

from typing import Generator
from urllib.parse import urljoin
import re
import scrapy
from scrapy.responsetypes import Response
import math
from apify import Actor

class TitleSpider(scrapy.Spider):
    name = 'title_spider'
    allowed_domains = ['primark.com']
    start_urls = [
        "https://www.primark.com/fr-fr/c/femme/vetements/pulls-et-gilets",
    ]

    def parse(self, response: Response)  -> Generator[dict, None, None]:
        Actor.log.info(f'TitleSpider is parsing {response}...')
        numberArticles = response.css('div.MuiBox-root > p.MuiTypography-body2::text').get()
        if numberArticles:
            numeric_part = re.search(r'\d+', numberArticles)
            if numeric_part:
                cleanNumberArticles = int(numeric_part.group())
                numberPages = math.ceil(cleanNumberArticles / 24) + 1
                page = response.url
                allPages = [page]
                for i in range(2,numberPages):
                    allPages.append(page+'?page='+str(i))
                for moreUrl in allPages:
                    yield scrapy.Request(dont_filter=True, url=moreUrl, callback=self._more_page)

    def _more_page(self, response: Response)  -> Generator[dict, None, None]:
        Actor.log.info(f"TitleSpider's second_parse is parsing {response}...")
        articleContainer = response.css('div.MuiGrid-root.MuiGrid-container')
        individualContainer = articleContainer.css('div.MuiGrid-item')
        articleLinkContainer = individualContainer.css('a.MuiTypography-colorPrimary')
        articleLink = articleLinkContainer.css('a::attr(href)').getall()
        articlePages = []
        for i in articleLink:
            if i.startswith('/fr-fr/p/'):
                articlePages.append('https://www.primark.com'+i)    
        articlePages = list(dict.fromkeys(articlePages))        
        for urlarticle in articlePages:
            yield scrapy.Request(url=urlarticle, callback=self._parse_article)

    def _parse_article(self, response: Response)   -> Generator[dict, None, None]:
        Actor.log.info(f"TitleSpider's third parse is parsing {response}...")
        productname = response.css('h1.MuiTypography-root.MuiTypography-body1::text').get()
        description = response.css('h5.MuiTypography-root.MuiTypography-body1::text').get()
        prix = response.css('p.MuiTypography-root.MuiTypography-body1::text').get()
        color = response.css('span.MuiTypography-root.MuiTypography-body2::text').get()
        breadcrumpContainer = response.css('li.MuiBreadcrumbs-li > a:first-child::text').getall()
        gender = breadcrumpContainer[0]
        firstCategorie = breadcrumpContainer[1]
        categorie = breadcrumpContainer[2]
        link = response.url
        imageContainer = response.css('div.jss1088 > img').getall()

        yield {
            'productname': productname,
            'description': description,
            'prix': prix,
            'color': color,
            'gender': gender,
            'firstCategorie': firstCategorie,
            'categorie': categorie,
            'link' : link,
            'imageContainer': imageContainer
        }`

vdusek commented 10 months ago

@ThibaultBouveur Thank you for pointing that out. I've replicated it, and you are correct, the requests generated by the second-level parse function are not being processed. After short investigating and reviewing the logs, it appears that when the second-level parse function yields a new request, the Scheduler.enqueue_request method is not being called. I've created a new issue https://github.com/apify/actor-templates/issues/224 to address this, and I'm gonna work on it.

vdusek commented 9 months ago

@ThibaultBouveur Hi, just want to let you know, that the issue https://github.com/apify/actor-templates/issues/224 has been resolved. The root cause was the inadvertent override of Scrapy's DEPTH_LIMIT option with the Actor input max_depth option in the template. It means the issue wasn't related to the second (and deeper) Spider parse functions but rather to the crawling limit. We solved this by removing the Actor input max_depth option from the template completely.

Feel free to give it a try again. I believe your project should work now. Just remember to set Scrapy's DEPTH_LIMIT option, or stay with the default, which is unlimited.

Additionally, we've recently introduced a wrapping script for Scrapy projects as part of our CLI. It automatically wraps your existing Scrapy project with Apify-related files, transforming it into an Apify Actor. See the doc page Integrating Scrapy projects for more information.