apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
4.08k stars 269 forks source link

Correct / recommended way of using `user_data` #563

Closed tlinhart closed 3 days ago

tlinhart commented 1 week ago

After the merge of this PR I receive type errors working with user_data. Consider this sample:

import asyncio

from crawlee import Request
from crawlee._utils.urls import convert_to_absolute_url, is_url_absolute
from crawlee.configuration import Configuration
from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext
from crawlee.router import Router

router = Router[ParselCrawlingContext]()

@router.default_handler
async def default_handler(context: ParselCrawlingContext) -> None:
    for category in context.selector.xpath(
        '//div[@class="side_categories"]//ul/li/ul/li/a'
    ):
        item = {"title": category.xpath("normalize-space()").get()}
        url = category.xpath("./@href").get()
        if url is not None:
            if not is_url_absolute(url):
                url = str(convert_to_absolute_url(context.request.url, url))
            request = Request.from_url(url, method="GET", label="detail")
            request.user_data["item"] = item  # <--- TYPE ERROR
            await context.add_requests([request])

@router.handler("detail")
async def detail_handler(context: ParselCrawlingContext) -> None:
    item = context.request.user_data["item"]
    item["results"] = context.selector.xpath("normalize-space(//form//strong[1])").get()  # <-- TYPE ERROR
    await context.push_data(item)

async def main() -> None:
    config = Configuration.get_global_configuration()
    config.persist_storage = False
    config.write_metadata = False
    crawler = ParselCrawler(request_handler=router)
    await crawler.run(["https://books.toscrape.com"])
    data = await crawler.get_data()
    print(data.items)

if __name__ == "__main__":
    asyncio.run(main())

Both in VS Code (with Pylance) and CLI (mypy) I get type errors on the highlighted spots. Mypy reports this:

./venv/bin/mypy main.py 
main.py:23: error: Incompatible types in assignment (expression has type "dict[str, str | None]", target has type "JsonValue")  [assignment]
main.py:30: error: Unsupported target for indexed assignment ("list[JsonValue] | dict[str, JsonValue] | str | bool | int | float | None")  [index]
main.py:30: error: No overload variant of "__setitem__" of "list" matches argument types "str", "str | None"  [call-overload]
main.py:30: note: Possible overload variants:
main.py:30: note:     def __setitem__(self, SupportsIndex, JsonValue, /) -> None
main.py:30: note:     def __setitem__(self, slice, Iterable[JsonValue], /) -> None
Found 3 errors in 1 file (checked 1 source file)
janbuchar commented 1 week ago

This is indeed strange - item is of type dict[str, str|None], which should be a valid JsonValue - str | None is a JsonValue and dict[str, JsonValue] should also be a JsonValue.

Please correct me if I'm missing something. If I'm not, this is a problem both in mypy and pyright.

tlinhart commented 1 week ago

This is what Pylance reports when hovering over the first of the two lines:

Argument typu dict[str, str | None] není možné přiřadit k parametru value typu JsonValue ve funkci __setitem__
  Type "dict[str, str | None]" is not assignable to type "JsonValue"
    "dict[str, str | None]" is not assignable to "List[JsonValue]"
    "dict[str, str | None]" is not assignable to "Dict[str, JsonValue]"
      Parametr typu „_VT@dict“ je invariantní, ale „str | None“ není stejný jako „JsonValue“.
      Zvažte přepnutí z „diktování“ na „mapování“, které je v typu hodnoty kovariantní
    "dict[str, str | None]" is not assignable to "str"
    "dict[str, str | None]" is not assignable to "bool"
    "dict[str, str | None]" is not assignable to "int"Pylance[reportArgumentType](https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportArgumentType)
(variable) user_data: dict[str, JsonValue]
janbuchar commented 1 week ago

Yeah, I believe that line 4 is not correct. We should see if this has already been reported with pyright, pylance, mypy or pydantic.

tlinhart commented 1 week ago

Maybe this – https://github.com/pydantic/pydantic/issues/9445?