Scrapy: Returning a Request object in the Spider.parse works

Description

Scrapy: Returning a Request object in the Spider.parse works.
Basically, the whole process of mapping between Apify and Scrapy Requests was improved - no information should be lost.

Issue

Closes #218

Testing

It was tested with the extended TitleSpider, both locally and on the platform.

from typing import Generator, Union
from urllib.parse import urljoin
from scrapy import Request, Spider
from scrapy.responsetypes import Response
from apify import Actor
from ..items import TitleItem

class TitleSpider(Spider):
    """
    Scrapes title pages and enqueues all links found on the page.
    """

    name = 'title_spider'
    start_urls = ['https://apify.com']

    def parse(self, response: Response) -> Generator[Union[TitleItem, Request], None, None]:
        """
        Parse the web page response.

        Args:
            response: The web page response.

        Yields:
            Yields scraped TitleItem and Requests for links.
        """
        Actor.log.info(f'TitleSpider is parsing {response}...')

        # Extract and yield the TitleItem
        url = response.url
        title = response.css('title::text').extract_first()
        yield TitleItem(url=url, title=title, parsed_by='parse')

        # Extract all links from the page, create Requests out of them, and yield them
        for link_href in response.css('a::attr("href")'):
            link_url = urljoin(response.url, link_href.get())
            if link_url.startswith(('http://', 'https://')):
                yield Request(link_url, callback=self.second_parse)

    def second_parse(self, response: Response) -> Generator[Union[TitleItem, Request], None, None]:
        Actor.log.info(f"TitleSpider's second_parse is parsing {response}...")
        # Extract and yield the TitleItem
        url = response.url
        title = response.css('title::text').extract_first()
        yield TitleItem(url=url, title=f'{title}', parsed_by='second_parse')

apify / actor-templates

Scrapy: Returning a Request object in the Spider.parse works #220

Description

Issue

Testing