Improve unique key generation logic

Description

Request Queue

Updates RequestQueue.add_request method by improving its unique key generation logic.
- The logic is exposed via new optional parameters of add_request.
I also extended the docstring to make clear what it does (mainly regarding the deduplication).
The PR introduces 3 new "helper" functions in the apify/_utils.py:
- get_short_base64_hash based on the _hashPayload.
- normalize_url based on the normalizeUrl.
- compute_unique_key based on the _computeUniqueKey.

Scrapy integration

Use this new Request Queue functionality in the Scrapy integration.
Also add support for scrapy.Request.dont_filter field in the to_apify_request.

Issues

Closing both of these issues:
- https://github.com/apify/apify-sdk-python/issues/141
- Technical debt in Request Queue.
- https://github.com/apify/apify-sdk-python/issues/190
- Bug in the Scrapy integration.

Testing

Unit tests

The new code is covered by unit tests.

Manual testing / execution

The YieldPostSpider tests the case of the POST requests to the same URL.

It's copied from #issuecomment-1978609687 - thank you @honzajavorek!

# spiders/yield_post.py
import json
import json
from typing import Generator, cast

from scrapy import Request, Spider as BaseSpider
from scrapy.http import TextResponse

class YieldPostSpider(BaseSpider):
    name = 'yield-post'

    def start_requests(self) -> Generator[Request, None, None]:
        for number in range(3):
            yield Request(
                'https://httpbin.org/post',
                method='POST',
                body=json.dumps(dict(code=f'CODE{number:0>4}', rate=number)),
                headers={'Content-Type': 'application/json'},
            )

    def parse(self, response: TextResponse) -> Generator[dict, None, None]:
        data = json.loads(cast(dict, response.json())['data'])
        yield data

The DontFilterSpider tests the case of the Scrapy request with dont_filter option.

It's copied from issuecomment-1978953372 - thank you @honzajavorek!

from typing import Generator

from scrapy import Request, Spider as BaseSpider
from scrapy.http import TextResponse

class DontFilterSpider(BaseSpider):
    name = 'dont-filter'

    def start_requests(self) -> Generator[Request, None, None]:
        for _ in range(3):
            yield Request('https://httpbin.org/get', method='GET', dont_filter=True)

    def parse(self, response: TextResponse) -> Generator[dict, None, None]:
        yield {'something': True}

And src/main.py to execute these spiders with Apify:

from __future__ import annotations

from scrapy.crawler import CrawlerProcess

from apify import Actor
from apify.scrapy.utils import apply_apify_settings

from .spiders.yield_post import YieldPostSpider as Spider
# from .spiders.dont_filter import DontFilterSpider as Spider

async def main() -> None:
    """Apify Actor main coroutine for executing the Scrapy spider."""
    async with Actor:
        Actor.log.info('Actor is being executed...')

        # Apply Apify settings, it will override the Scrapy project settings
        settings = apply_apify_settings()

        # Execute the spider using Scrapy CrawlerProcess
        process = CrawlerProcess(settings, install_root_handler=False)
        process.crawl(Spider)
        process.start()

Execute it using Scrapy:

scrapy crawl 'dont-filter' -o dont_filter_output.json

scrapy crawl 'yield-post' -o yield_post_output.json

And using Apify (need to change the Spider in the main.py manually):

apify run --purge

And it produces the same output :tada:.

apify / apify-sdk-python

Improve unique key generation logic #193