apify / apify-sdk-python

The Apify SDK for Python is the official library for creating Apify Actors in Python. It provides useful features like actor lifecycle management, local storage emulation, and actor event handling.
https://docs.apify.com/sdk/python
Apache License 2.0
115 stars 11 forks source link

Improve unique key generation logic #193

Closed vdusek closed 6 months ago

vdusek commented 6 months ago

Description

Request Queue

Scrapy integration

Issues

Testing

Unit tests

Manual testing / execution

The YieldPostSpider tests the case of the POST requests to the same URL.

# spiders/yield_post.py
import json
import json
from typing import Generator, cast

from scrapy import Request, Spider as BaseSpider
from scrapy.http import TextResponse

class YieldPostSpider(BaseSpider):
    name = 'yield-post'

    def start_requests(self) -> Generator[Request, None, None]:
        for number in range(3):
            yield Request(
                'https://httpbin.org/post',
                method='POST',
                body=json.dumps(dict(code=f'CODE{number:0>4}', rate=number)),
                headers={'Content-Type': 'application/json'},
            )

    def parse(self, response: TextResponse) -> Generator[dict, None, None]:
        data = json.loads(cast(dict, response.json())['data'])
        yield data

The DontFilterSpider tests the case of the Scrapy request with dont_filter option.

from typing import Generator

from scrapy import Request, Spider as BaseSpider
from scrapy.http import TextResponse

class DontFilterSpider(BaseSpider):
    name = 'dont-filter'

    def start_requests(self) -> Generator[Request, None, None]:
        for _ in range(3):
            yield Request('https://httpbin.org/get', method='GET', dont_filter=True)

    def parse(self, response: TextResponse) -> Generator[dict, None, None]:
        yield {'something': True}

And src/main.py to execute these spiders with Apify:

from __future__ import annotations

from scrapy.crawler import CrawlerProcess

from apify import Actor
from apify.scrapy.utils import apply_apify_settings

from .spiders.yield_post import YieldPostSpider as Spider
# from .spiders.dont_filter import DontFilterSpider as Spider

async def main() -> None:
    """Apify Actor main coroutine for executing the Scrapy spider."""
    async with Actor:
        Actor.log.info('Actor is being executed...')

        # Apply Apify settings, it will override the Scrapy project settings
        settings = apply_apify_settings()

        # Execute the spider using Scrapy CrawlerProcess
        process = CrawlerProcess(settings, install_root_handler=False)
        process.crawl(Spider)
        process.start()

Execute it using Scrapy:

scrapy crawl 'dont-filter' -o dont_filter_output.json
scrapy crawl 'yield-post' -o yield_post_output.json

And using Apify (need to change the Spider in the main.py manually):

apify run --purge

And it produces the same output :tada:.

vdusek commented 6 months ago

@fnesveda Just FYI; not requesting a review from you, taking into account the current situation, so I added Jirka as a reviewer instead.