apify / crawlee-python

Crawleeβ€”A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
4.43k stars 308 forks source link

Implement/document a way how to pass information between handlers #524

Closed honzajavorek closed 1 month ago

honzajavorek commented 1 month ago

I came across a situation where I scrape half of the item's data in the listing page handler and the other half in a handler taking care of the detail page. I think must be quite common case. I struggle to see how I pass the information down from one handler to another. See concrete example below:

import re
import asyncio
from enum import StrEnum, auto

import click
from crawlee.beautifulsoup_crawler import (
    BeautifulSoupCrawler,
    BeautifulSoupCrawlingContext,
)
from crawlee.router import Router

LENGTH_RE = re.compile(r"(\d+)\s+min")

class Label(StrEnum):
    DETAIL = auto()

router = Router[BeautifulSoupCrawlingContext]()

@click.command()
def edison():
    asyncio.run(scrape())

async def scrape():
    crawler = BeautifulSoupCrawler(request_handler=router)
    await crawler.run(["https://edisonfilmhub.cz/program"])
    await crawler.export_data("edison.json", dataset_name="edison")

@router.default_handler
async def detault_handler(context: BeautifulSoupCrawlingContext):
    await context.enqueue_links(selector=".program_table .name a", label=Label.DETAIL)

@router.handler(Label.DETAIL)
async def detail_handler(context: BeautifulSoupCrawlingContext):
    context.log.info(f"Scraping {context.request.url}")

    description = context.soup.select_one(".filmy_page .desc3").text
    length_min = LENGTH_RE.search(description).group(1)
    # TODO get starts_at, then calculate ends_at

    await context.push_data(
        {
            "url": context.request.url,
            "title": context.soup.select_one(".filmy_page h1").text.strip(),
            "csfd_url": context.soup.select_one(".filmy_page .hrefs a")["href"],
        },
        dataset_name="edison",
    )

I need to scrape starts_at at the default_handler, then add more details to the item on the detail page, and calculate the ends_at time according to the length of the film. Even if I changed enqueue_links to something more delicate, how do I pass data from one request to another?

vdusek commented 1 month ago

You probably want to use the user_data argument of enqueue_links:

@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
    context.log.info(f'Processing {context.request.url}')

    extracted_data_in_default_handler = context.soup.title.string

    await context.enqueue_links(
        user_data={'extracted_data_in_default_handler': extracted_data_in_default_handler},
    )
honzajavorek commented 1 month ago

That sounds about right. Haven't found much about it in the docs, at least using the built-in search. How do I access it in the other handler? Something like context.user_data?

B4nan commented 1 month ago

It's an attribute of the request, so you should be able to use context.request.user_data.

honzajavorek commented 1 month ago

Cool, thanks! This was my main blocker when developing kino over the weekend. Can't promise getting back to this soon as it's just a hobby thing, but I'll assume this is enough info for me to make it work. Feel free to close this unless you want to turn it into a tracking issue of "this needs more examples in the docs".

vdusek commented 1 month ago

Cool, thanks! This was my main blocker when developing kino over the weekend. Can't promise getting back to this soon as it's just a hobby thing, but I'll assume this is enough info for me to make it work. Feel free to close this unless you want to turn it into a tracking issue of "this needs more examples in the docs".

Great, let us know once you try it. IMO we should add some examples to docs regarding this topic, so we can leave this open.

honzajavorek commented 1 month ago

It works, but it has a surprising quirk. It stringifies all dict values. It's not JSON, it's just as if I casted str() around the value. And it's not applied to the data I send, but to the values of a dict I send. This must be a bug. Minimal example:

@router.default_handler
async def detault_handler(context: BeautifulSoupCrawlingContext):
    timetable = defaultdict(set)
    for i in range(5):
        url = f"https://example.com/{i}"
        timetable[url].add(datetime(2024, 11, 2, 15, 30), tzinfo=zoneinfo.ZoneInfo(key='Europe/Prague'))
    await context.enqueue_links(
        selector=".program_table .name a",
        user_data={"timetable": timetable},
        label="detail",
    )

@router.handler("detail")
async def detail_handler(context: BeautifulSoupCrawlingContext):
    for url, starts_ats in context.request.user_data["timetable"].items():
        print(starts_ats)  # {datetime.datetime(2024, 11, 2, 15, 30, tzinfo=zoneinfo.ZoneInfo(key='Europe/Prague'))}
        print(type(starts_ats))  # <class 'str'>

I didn't play with it further, so I don't know what else it stringifies this way, but obviously such data is unusable. If this is expected behavior, I'd have to JSON-encode those and then JSON-decode. I could use list instead of set, that's okay, but I'd have to manually .isoformat() all the dates and then parse them to get back native types. That feels like a lot of unnecessary boilerplate which shouldn't be needed or if it's needed because of certain architecture of the framework, it should happen automatically.

honzajavorek commented 1 month ago

As a separate comment I want to also add that I noticed there is nothing more delicate than enqueue_links(). What I happen to need is to traverse a timetable for times of movie screenings and then scrape individual movie URLs for details about the movie. What would be most natural to me is something like:

# JUST PSEUDOCODE
for item in soup.select(".item"):
    time = datetime.fromisoformat(item.select_one(".screening-time").text)
    link = item.select_one(".movie-link")
    context.enqueue_link(link, user_data={"time": time}, label="detail")

This is an approach I'm used to do when creating scrapers, all my life. With both one-time scripts without any framework or with Scrapy. I don't believe I'm alone.

As far as I understand, there's currently no way to do that in Crawlee. I'm forced to scrape the whole timetable and then pass it down to the detail handler. In the detail handler, I (want to, if not for the bug above) look up the movie in the timetable by its URL and get the screening times. This feels unnatural.

I'm not sure if I do it wrong all the time and the way Crawlee does it is the preferred way to do this for some good reasons, or whether it's just bad UX (actually DX).

Because... Getting the timetable first and then pairing back the movies takes care of duplicate requests and forces me to think about the situation when the same movie is screened multiple times. Crawlee forced me into unnatural architecture of my scraper, resulting in a better algorithm!

Should I ask for .enqueue_link() in a feature request or should I admit that not having it is smart? I don't know. Maybe there are other cases, when enqueuing link by link with data attached would totally make sense. What do you think?

janbuchar commented 1 month ago

As a separate comment I want to also add that I noticed there is nothing more delicate than enqueue_links(). What I happen to need is to traverse a timetable for times of movie screenings and then scrape individual movie URLs for details about the movie. What would be most natural to me is something like:

I'm not sure I understand the whole thought process, but the example you posted is not far from being correct. You could just do this:

for item in soup.select(".item"):
    time = datetime.fromisoformat(item.select_one(".screening-time").text)
    link = item.select_one(".movie-link")
    context.add_requests([Request.from_url(link, user_data={"time": time}, label="detail")])

Or even better, you could gather the links in a list and then pass a list of Request objects to a single context.add_requests call.

Did I understand the question right?

honzajavorek commented 1 month ago

I think you did! I completely overlooked the existence of add_requests() and add_requests() 🀯 And being aware about Request.from_url() will be also quite handy πŸ˜„ Thank you! ❀️ I'm sorry if this is documented in some tutorials which I might have skipped πŸ™ Awesome. So the only issue is that user_data stringifies parts of the data.

janbuchar commented 1 month ago

I think you did! I completely overlooked the existence of add_requests() and add_requests() 🀯 And being aware about Request.from_url() will be also quite handy πŸ˜„ Thank you! ❀️ I'm sorry if this is documented in some tutorials which I might have skipped πŸ™ Awesome. So the only issue is that user_data stringifies parts of the data.

This is the relevant documentation - https://crawlee.dev/python/docs/guides/request-storage#request-related-helpers. Feel free to suggest a better place, or if you think that there's a place where a link to this page would be useful.

janbuchar commented 1 month ago

It works, but it has a surprising quirk. It stringifies all dict values. It's not JSON, it's just as if I casted str() around the value. And it's not applied to the data I send, but to the values of a dict I send. This must be a bug.

Short answer is, you need to put JSON-serializable data into user_data. I should be able to validate this so that you get an exception instead of nonsensical output.

honzajavorek commented 1 month ago

Feel free to suggest a better place, or if you think that there's a place where a link to this page would be useful.

I think that guide goes through it well, I had to overlook that one. I went through the intro and I only noticed enqueue_links(). Then I tried suggestions in my editor, but I started typing enqueue, looking for methods like enqueue_link(). I think it's just bad luck and nothing particularly wrong on the docs side.

Short answer is, you need to put JSON-serializable data into user_data. I should be able to validate this so that you get an exception instead of nonsensical output.

Uff, so I'll need to do the heavy lifting. Getting exception is definitely better than just being surprised by the result, but sending just JSON-serializable data is very limiting. Especially when working with dates, or, e.g. money (= decimals). I can use Pydantic or something to help me with serialization and deserialization, but it feels strange that I have to do it just to pass a dict from one function to another.

honzajavorek commented 1 month ago

My workaround for now:

from pydantic import RootModel

TimeTable = RootModel[dict[str, set[datetime]]]

@router.default_handler
async def detault_handler(context: BeautifulSoupCrawlingContext):
    ...
    timetable_json = TimeTable(timetable).model_dump_json()
    await context.enqueue_links(
        selector=".program_table .name a",
        user_data=dict(timetable=timetable_json),
        label="detail",
    )

@router.handler("detail")
async def detail_handler(context: BeautifulSoupCrawlingContext):
    ...
    timetable_json = context.request.user_data["timetable"]
    timetable = TimeTable.model_validate_json(timetable_json).model_dump()

    for starts_at in timetable[context.request.url]:
        ends_at = starts_at + timedelta(minutes=length_min)
        await context.push_data(...)

Couldn't come up with anything more beautiful.

janbuchar commented 1 month ago

I think that guide goes through it well, I had to overlook that one. I went through the intro and I only noticed enqueue_links(). Then I tried suggestions in my editor, but I started typing enqueue, looking for methods like enqueue_link(). I think it's just bad luck and nothing particularly wrong on the docs side.

That part of the docs is also pretty new, so that might have played a part as well...

Short answer is, you need to put JSON-serializable data into user_data. I should be able to validate this so that you get an exception instead of nonsensical output.

Uff, so I'll need to do the heavy lifting. Getting exception is definitely better than just being surprised by the result, but sending just JSON-serializable data is very limiting. Especially when working with dates, or, e.g. money (= decimals). I can use Pydantic or something to help me with serialization and deserialization, but it feels strange that I have to do it just to pass a dict from one function to another.

The reason for this requirement is that the request queue needs to be able to handle millions of items, and the local implementation uses the filesystem for that. If you deploy to Apify, the request will be sent as a JSON payload. So there's JSON serialization involved every time - this is no artificial restriction :slightly_smiling_face:

My workaround for now: ... Couldn't come up with anything more beautiful.

Honestly, I think this is fine. You don't need to use model_dump_json and model_validate_json, just model_dump and model_validate should work as well. Maybe it would make sense to have a context helper for loading and storing user data given a Pydantic model :shrug:

honzajavorek commented 1 month ago

I was able to get Crawlee working for the use case, and I learned a lot about parts I didn't know about. The user_data delivers well together with the PyDantic workaround. Thanks for answering all my questions and clearing all uncertainties I had! Apart from surprising serialization, everything is addressed, so I'll close this issue now.

How much surprising or necessary or convenient that serialization is, that is something I'll leave up to you and perhaps aggregated feedback from more users than just me. I'll attach my two cents below, but I don't want to get stuck discussing this further, because I myself think this is just a small ergonomic annoyance and we both have probably better things to do. You work on a framework, where this is just a tiny part, and I have some scrapers to finish πŸ˜„

One user's POV on serialization of user_data I understand the limitation isn't arbitrary, but conditioned by the architecture. However, as someone who wants to focus on writing scrapers, this has implications for my DX: - I don't expect the limitation. I write a scraper and expect the scraping itself to have pitfalls, not the framework. And the architecture is otherwise opaque. In Crawlee, I can switch from BS to browser and the code is almost the same. The framework does a great job in hiding the technical details. And suddenly, here it _feels_ like sending data between two functions, but it's limited in a unnatural way. The only place where I remember experiencing similar surprising limitation is `multiprocessing.Pool()` and its friends, which gives me segmentation faults if I try to send anything else than simple data structures between processes. Authors of the module did their best to hide the underlying machinery, so it feels like sending data between functions, but in the end I must think about the machinery and change my code accordingly. - In Scrapy I don't have to think about this at all. Not sure how many millions of items it can handle. - It's awesome that Crawlee scales to millions, but I do many different scrapers, each mostly up to tens, hundreds, or thousands items scraped. IMHO this is a category where most scraper devs will fall into, but you may have a better overview of the scraping scene. My point is, each architecture has a trade-off. The advantage I'm getting by the serialization being present in the architecture isn't important for my use cases. It's just there, in my way, so I'm getting just the downside. Maybe, if I deploy the scraper to Apify, I would get the advantage of seamless integration. I'm only trying to provide the mere user's POV on the matter. I can live just fine with PyDantic and model dumps, the same way I can live with other things which annoy me when using Scrapy. It's just that I can see this framework is being built right now, and I care about it, so I'm keen to provide this kind of feedback. I don't run these discussions at the Scrapy repo - not because I love how it's done, but because I don't care that much.
B4nan commented 1 month ago

In Scrapy I don't have to think about this at all. Not sure how many millions of items it can handle.

This feels quite unlikely, I doubt they store everything in memory, which is the only way to do what you want. Or maybe you lose the values if they no longer fit into the memory.

Btw this is not just about the use case of millions of items not fitting into the memory, it's also about being able to continue a failed/stopped run, or the infamous migrations on the apify platform.

janbuchar commented 1 month ago

In Scrapy I don't have to think about this at all. Not sure how many millions of items it can handle.

This feels quite unlikely, I doubt they store everything in memory, which is the only way to do what you want. Or maybe you lose the values if they no longer fit into the memory.

Well, they could be using something like pickle under the hood, which stores Python type information along with the data, but is infamous for not being compatible across Python version and breaking when your code changes.

honzajavorek commented 1 month ago

BTW

You don't need to use model_dump_json and model_validate_json, just model_dump and model_validate should work as well.

I don't think so. Haven't tested, but I thought that just turns the model to a dictionary. It doesn't solve the problem with datetimes, for example, does it? I mean, if my data contains stuff like sets, dictionaries, or decimals, then model_dump() doesn't make it JSON-serializable.

I didn't need PyDantic in my code. I used it now only for the purpose of not having to write my own default= to json.dumps() and polluting my code with that boilerplate, and then having to write something that turns ISO-formatted datetimes back to actual datetimes. PyDantic solves it for me as a workaround, with the added benefit of type checking, but I didn't really need types for such small scraper.

janbuchar commented 1 month ago

BTW

You don't need to use model_dump_json and model_validate_json, just model_dump and model_validate should work as well.

I don't think so. Haven't tested, but I thought that just turns the model to a dictionary. It doesn't solve the problem with datetimes, for example, does it? I mean, if my data contains stuff like sets, dictionaries, or decimals, then model_dump() doesn't make it JSON-serializable.

You're right, I mistook the behavior with jsonable_encoder from FastAPI :facepalm:

I didn't need PyDantic in my code. I used it now only for the purpose of not having to write my own default= to json.dumps() and polluting my code with that boilerplate, and then having to write something that turns ISO-formatted datetimes back to actual datetimes. PyDantic solves it for me as a workaround, with the added benefit of type checking, but I didn't really need types for such small scraper.

Gotcha. If you had any idea for a better API for user data handling, we're all ears. I'm very reluctant towards the likes of pickle though...

honzajavorek commented 1 month ago

I think multiprocessing pickles the data under the hood, hence the segmentation faults, I guess πŸ˜‚ The problem is that we need to deserialize the types, too. It's not a problem to take whatever users sends our way and stringify it with an enhanced json.dumps() with default= which implements few types which are used often. The problem is to pick up the serialized data and re-create the types.

I don't think there's anything else than pickle or pydantic. I mean, widely used in the ecosystem, not something experimental or with marginal popularity. Pickle is native but has issues, pydantic would be a dependency. Maybe a safe subset of pickle? Maybe dataclasses? Ideas:

  1. We're happy to accept user data which is either plain JSON-serializable, or something which consists of only certain types we allow (set, date, datetime, decimal...). Then we'd pickle it, otherwise raise an error with information on what is allowed.
  2. We're happy to accept user data which is either plain JSON-serializable, or a dataclass which consists of only certain types we allow (set, date, datetime, decimal...). Then we'll serialize and deserialize it for you, otherwise raise an error with information on what is allowed.

In the second scenario though Crawlee would have to take care of the serializing and deserializing logic, and basically duplicating at least some of pydantic's job. So might be simpler to just use pydantic for it and happily accept both dataclasses or pydantic models in the argument, if users are keen to pass them:

  1. We're happy to accept user data which is either plain JSON-serializable, or a dataclass, or a pydantic model. Then we'll serialize and deserialize it for you using pydantic, and if that fails, we just re-raise the error.

That way, unless I pass something convoluted with non-serializable types, I should not need to know there's any serialization at all. I send what I have and if that fails, the framework just asks me to send it as a dataclass, which is standard lib. If I'm fancy and keen to read the docs, I can learn that pydantic model would do as well.