Closed vdusek closed 10 months ago
Let's solve the problem of the Spider.start_requests()
method (https://github.com/scrapy/scrapy/blob/2.11/scrapy/spiders/__init__.py#L68:L76) as part of this issue since it's quite connected. So I'm gonna increase the estimation.
There is also a problem with the returning of a Request
object in the Spider.parse
with the callback
. Or to be more precise any other parameters. They are lost during the transformation process from a Scrapy Request
into an Apify Request
. On top of that, some Request
fields cannot even be stored in the Request Queue (e.g. callback
). Let's try to solve this problem as part of this issue as well.
@vdusek Thanks a lot for working on this bug. I tested my code again and I don't think it works despite the ticked completed. Any thing I am missing?
@ThibaultBouveur I tried it again with the code from the issue description ("How to replicate it" section) and it works for me. Are you using the new version of the python-scrapy template? If so, could you provide me a code that I can replicate it? Thanks :slightly_smiling_face:.
@vdusek I think I messed up with the new template, I will retry with the new template. The "npx apify-cli create" command will use the new template, right? Thanks a lot
@ThibaultBouveur Yeah, just install Apify CLI (check the docs), run apify create
, provide the name of your Actor, select Python and Scrapy. Then copy your scrapy
files into the template and it should be ready to go :slightly_smiling_face:.
Ok so I tried something with a 3rd parse (see the code below), and with your fix, the second callback works but not the third one. Is using callbacks this way not the good methodology?
from typing import Generator
from urllib.parse import urljoin
import re
import scrapy
from scrapy.responsetypes import Response
import math
from apify import Actor
class TitleSpider(scrapy.Spider):
name = 'title_spider'
allowed_domains = ['primark.com']
start_urls = [
"https://www.primark.com/fr-fr/c/femme/vetements/pulls-et-gilets",
]
def parse(self, response: Response) -> Generator[dict, None, None]:
Actor.log.info(f'TitleSpider is parsing {response}...')
numberArticles = response.css('div.MuiBox-root > p.MuiTypography-body2::text').get()
if numberArticles:
numeric_part = re.search(r'\d+', numberArticles)
if numeric_part:
cleanNumberArticles = int(numeric_part.group())
numberPages = math.ceil(cleanNumberArticles / 24) + 1
page = response.url
allPages = [page]
for i in range(2,numberPages):
allPages.append(page+'?page='+str(i))
for moreUrl in allPages:
yield scrapy.Request(dont_filter=True, url=moreUrl, callback=self._more_page)
def _more_page(self, response: Response) -> Generator[dict, None, None]:
Actor.log.info(f"TitleSpider's second_parse is parsing {response}...")
articleContainer = response.css('div.MuiGrid-root.MuiGrid-container')
individualContainer = articleContainer.css('div.MuiGrid-item')
articleLinkContainer = individualContainer.css('a.MuiTypography-colorPrimary')
articleLink = articleLinkContainer.css('a::attr(href)').getall()
articlePages = []
for i in articleLink:
if i.startswith('/fr-fr/p/'):
articlePages.append('https://www.primark.com'+i)
articlePages = list(dict.fromkeys(articlePages))
for urlarticle in articlePages:
yield scrapy.Request(url=urlarticle, callback=self._parse_article)
def _parse_article(self, response: Response) -> Generator[dict, None, None]:
Actor.log.info(f"TitleSpider's third parse is parsing {response}...")
productname = response.css('h1.MuiTypography-root.MuiTypography-body1::text').get()
description = response.css('h5.MuiTypography-root.MuiTypography-body1::text').get()
prix = response.css('p.MuiTypography-root.MuiTypography-body1::text').get()
color = response.css('span.MuiTypography-root.MuiTypography-body2::text').get()
breadcrumpContainer = response.css('li.MuiBreadcrumbs-li > a:first-child::text').getall()
gender = breadcrumpContainer[0]
firstCategorie = breadcrumpContainer[1]
categorie = breadcrumpContainer[2]
link = response.url
imageContainer = response.css('div.jss1088 > img').getall()
yield {
'productname': productname,
'description': description,
'prix': prix,
'color': color,
'gender': gender,
'firstCategorie': firstCategorie,
'categorie': categorie,
'link' : link,
'imageContainer': imageContainer
}`
@ThibaultBouveur Thank you for pointing that out. I've replicated it, and you are correct, the requests generated by the second-level parse function are not being processed. After short investigating and reviewing the logs, it appears that when the second-level parse function yields a new request, the Scheduler.enqueue_request
method is not being called. I've created a new issue https://github.com/apify/actor-templates/issues/224 to address this, and I'm gonna work on it.
@ThibaultBouveur Hi, just want to let you know, that the issue https://github.com/apify/actor-templates/issues/224 has been resolved. The root cause was the inadvertent override of Scrapy's DEPTH_LIMIT
option with the Actor input max_depth
option in the template. It means the issue wasn't related to the second (and deeper) Spider parse functions but rather to the crawling limit. We solved this by removing the Actor input max_depth
option from the template completely.
Feel free to give it a try again. I believe your project should work now. Just remember to set Scrapy's DEPTH_LIMIT option, or stay with the default, which is unlimited.
Additionally, we've recently introduced a wrapping script for Scrapy projects as part of our CLI. It automatically wraps your existing Scrapy project with Apify-related files, transforming it into an Apify Actor. See the doc page Integrating Scrapy projects for more information.
Problem description
Scrapy users have the option of returning a scrapy.http.Request object from the scrapy.Spider.parse method. The returned Request will be parsed later. Also, a callback function can be provided to the Request constructor, to specify an alternative function for its processing.
However, if the Request is returned, it won't be processed in case of executing it on the Apify platform. Because it does not go through the Request Queue. We would have to use RequestQueue.add_request to prepare a Request for later processing.
How to replicate it
from scrapy import Field, Item
class TitleItem(Item): """ Represents a title item scraped from a web page. """
Execute the project just using Scrapy, it will work:
Execute the project using Apify, it will not work:
Tasks
Request
object in theSpider.parse
method works - theRequest
will be scheduled.start_urls
works.