Closed honzajavorek closed 1 month ago
You probably want to use the user_data
argument of enqueue_links
:
@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')
extracted_data_in_default_handler = context.soup.title.string
await context.enqueue_links(
user_data={'extracted_data_in_default_handler': extracted_data_in_default_handler},
)
That sounds about right. Haven't found much about it in the docs, at least using the built-in search. How do I access it in the other handler? Something like context.user_data
?
It's an attribute of the request, so you should be able to use context.request.user_data
.
Cool, thanks! This was my main blocker when developing kino over the weekend. Can't promise getting back to this soon as it's just a hobby thing, but I'll assume this is enough info for me to make it work. Feel free to close this unless you want to turn it into a tracking issue of "this needs more examples in the docs".
Cool, thanks! This was my main blocker when developing kino over the weekend. Can't promise getting back to this soon as it's just a hobby thing, but I'll assume this is enough info for me to make it work. Feel free to close this unless you want to turn it into a tracking issue of "this needs more examples in the docs".
Great, let us know once you try it. IMO we should add some examples to docs regarding this topic, so we can leave this open.
It works, but it has a surprising quirk. It stringifies all dict
values. It's not JSON, it's just as if I casted str()
around the value. And it's not applied to the data I send, but to the values of a dict
I send. This must be a bug. Minimal example:
@router.default_handler
async def detault_handler(context: BeautifulSoupCrawlingContext):
timetable = defaultdict(set)
for i in range(5):
url = f"https://example.com/{i}"
timetable[url].add(datetime(2024, 11, 2, 15, 30), tzinfo=zoneinfo.ZoneInfo(key='Europe/Prague'))
await context.enqueue_links(
selector=".program_table .name a",
user_data={"timetable": timetable},
label="detail",
)
@router.handler("detail")
async def detail_handler(context: BeautifulSoupCrawlingContext):
for url, starts_ats in context.request.user_data["timetable"].items():
print(starts_ats) # {datetime.datetime(2024, 11, 2, 15, 30, tzinfo=zoneinfo.ZoneInfo(key='Europe/Prague'))}
print(type(starts_ats)) # <class 'str'>
I didn't play with it further, so I don't know what else it stringifies this way, but obviously such data is unusable. If this is expected behavior, I'd have to JSON-encode those and then JSON-decode. I could use list instead of set
, that's okay, but I'd have to manually .isoformat()
all the dates and then parse them to get back native types. That feels like a lot of unnecessary boilerplate which shouldn't be needed or if it's needed because of certain architecture of the framework, it should happen automatically.
As a separate comment I want to also add that I noticed there is nothing more delicate than enqueue_links()
. What I happen to need is to traverse a timetable for times of movie screenings and then scrape individual movie URLs for details about the movie. What would be most natural to me is something like:
# JUST PSEUDOCODE
for item in soup.select(".item"):
time = datetime.fromisoformat(item.select_one(".screening-time").text)
link = item.select_one(".movie-link")
context.enqueue_link(link, user_data={"time": time}, label="detail")
This is an approach I'm used to do when creating scrapers, all my life. With both one-time scripts without any framework or with Scrapy. I don't believe I'm alone.
As far as I understand, there's currently no way to do that in Crawlee. I'm forced to scrape the whole timetable and then pass it down to the detail handler. In the detail handler, I (want to, if not for the bug above) look up the movie in the timetable by its URL and get the screening times. This feels unnatural.
I'm not sure if I do it wrong all the time and the way Crawlee does it is the preferred way to do this for some good reasons, or whether it's just bad UX (actually DX).
Because... Getting the timetable first and then pairing back the movies takes care of duplicate requests and forces me to think about the situation when the same movie is screened multiple times. Crawlee forced me into unnatural architecture of my scraper, resulting in a better algorithm!
Should I ask for .enqueue_link()
in a feature request or should I admit that not having it is smart? I don't know. Maybe there are other cases, when enqueuing link by link with data attached would totally make sense. What do you think?
As a separate comment I want to also add that I noticed there is nothing more delicate than
enqueue_links()
. What I happen to need is to traverse a timetable for times of movie screenings and then scrape individual movie URLs for details about the movie. What would be most natural to me is something like:
I'm not sure I understand the whole thought process, but the example you posted is not far from being correct. You could just do this:
for item in soup.select(".item"):
time = datetime.fromisoformat(item.select_one(".screening-time").text)
link = item.select_one(".movie-link")
context.add_requests([Request.from_url(link, user_data={"time": time}, label="detail")])
Or even better, you could gather the links in a list and then pass a list of Request
objects to a single context.add_requests
call.
Did I understand the question right?
I think you did! I completely overlooked the existence of add_requests()
and add_requests()
π€― And being aware about Request.from_url()
will be also quite handy π Thank you! β€οΈ I'm sorry if this is documented in some tutorials which I might have skipped π Awesome. So the only issue is that user_data
stringifies parts of the data.
I think you did! I completely overlooked the existence of
add_requests()
andadd_requests()
π€― And being aware aboutRequest.from_url()
will be also quite handy π Thank you! β€οΈ I'm sorry if this is documented in some tutorials which I might have skipped π Awesome. So the only issue is thatuser_data
stringifies parts of the data.
This is the relevant documentation - https://crawlee.dev/python/docs/guides/request-storage#request-related-helpers. Feel free to suggest a better place, or if you think that there's a place where a link to this page would be useful.
It works, but it has a surprising quirk. It stringifies all
dict
values. It's not JSON, it's just as if I castedstr()
around the value. And it's not applied to the data I send, but to the values of adict
I send. This must be a bug.
Short answer is, you need to put JSON-serializable data into user_data
. I should be able to validate this so that you get an exception instead of nonsensical output.
Feel free to suggest a better place, or if you think that there's a place where a link to this page would be useful.
I think that guide goes through it well, I had to overlook that one. I went through the intro and I only noticed enqueue_links()
. Then I tried suggestions in my editor, but I started typing enqueue
, looking for methods like enqueue_link()
. I think it's just bad luck and nothing particularly wrong on the docs side.
Short answer is, you need to put JSON-serializable data into
user_data
. I should be able to validate this so that you get an exception instead of nonsensical output.
Uff, so I'll need to do the heavy lifting. Getting exception is definitely better than just being surprised by the result, but sending just JSON-serializable data is very limiting. Especially when working with dates, or, e.g. money (= decimals). I can use Pydantic or something to help me with serialization and deserialization, but it feels strange that I have to do it just to pass a dict from one function to another.
My workaround for now:
from pydantic import RootModel
TimeTable = RootModel[dict[str, set[datetime]]]
@router.default_handler
async def detault_handler(context: BeautifulSoupCrawlingContext):
...
timetable_json = TimeTable(timetable).model_dump_json()
await context.enqueue_links(
selector=".program_table .name a",
user_data=dict(timetable=timetable_json),
label="detail",
)
@router.handler("detail")
async def detail_handler(context: BeautifulSoupCrawlingContext):
...
timetable_json = context.request.user_data["timetable"]
timetable = TimeTable.model_validate_json(timetable_json).model_dump()
for starts_at in timetable[context.request.url]:
ends_at = starts_at + timedelta(minutes=length_min)
await context.push_data(...)
Couldn't come up with anything more beautiful.
I think that guide goes through it well, I had to overlook that one. I went through the intro and I only noticed
enqueue_links()
. Then I tried suggestions in my editor, but I started typingenqueue
, looking for methods likeenqueue_link()
. I think it's just bad luck and nothing particularly wrong on the docs side.
That part of the docs is also pretty new, so that might have played a part as well...
Short answer is, you need to put JSON-serializable data into
user_data
. I should be able to validate this so that you get an exception instead of nonsensical output.Uff, so I'll need to do the heavy lifting. Getting exception is definitely better than just being surprised by the result, but sending just JSON-serializable data is very limiting. Especially when working with dates, or, e.g. money (= decimals). I can use Pydantic or something to help me with serialization and deserialization, but it feels strange that I have to do it just to pass a dict from one function to another.
The reason for this requirement is that the request queue needs to be able to handle millions of items, and the local implementation uses the filesystem for that. If you deploy to Apify, the request will be sent as a JSON payload. So there's JSON serialization involved every time - this is no artificial restriction :slightly_smiling_face:
My workaround for now: ... Couldn't come up with anything more beautiful.
Honestly, I think this is fine. You don't need to use model_dump_json
and model_validate_json
, just model_dump
and model_validate
should work as well. Maybe it would make sense to have a context helper for loading and storing user data given a Pydantic model :shrug:
I was able to get Crawlee working for the use case, and I learned a lot about parts I didn't know about. The user_data
delivers well together with the PyDantic workaround. Thanks for answering all my questions and clearing all uncertainties I had! Apart from surprising serialization, everything is addressed, so I'll close this issue now.
How much surprising or necessary or convenient that serialization is, that is something I'll leave up to you and perhaps aggregated feedback from more users than just me. I'll attach my two cents below, but I don't want to get stuck discussing this further, because I myself think this is just a small ergonomic annoyance and we both have probably better things to do. You work on a framework, where this is just a tiny part, and I have some scrapers to finish π
In Scrapy I don't have to think about this at all. Not sure how many millions of items it can handle.
This feels quite unlikely, I doubt they store everything in memory, which is the only way to do what you want. Or maybe you lose the values if they no longer fit into the memory.
Btw this is not just about the use case of millions of items not fitting into the memory, it's also about being able to continue a failed/stopped run, or the infamous migrations on the apify platform.
In Scrapy I don't have to think about this at all. Not sure how many millions of items it can handle.
This feels quite unlikely, I doubt they store everything in memory, which is the only way to do what you want. Or maybe you lose the values if they no longer fit into the memory.
Well, they could be using something like pickle
under the hood, which stores Python type information along with the data, but is infamous for not being compatible across Python version and breaking when your code changes.
BTW
You don't need to use model_dump_json and model_validate_json, just model_dump and model_validate should work as well.
I don't think so. Haven't tested, but I thought that just turns the model to a dictionary. It doesn't solve the problem with datetimes, for example, does it? I mean, if my data contains stuff like sets, dictionaries, or decimals, then model_dump()
doesn't make it JSON-serializable.
I didn't need PyDantic in my code. I used it now only for the purpose of not having to write my own default=
to json.dumps()
and polluting my code with that boilerplate, and then having to write something that turns ISO-formatted datetimes back to actual datetimes. PyDantic solves it for me as a workaround, with the added benefit of type checking, but I didn't really need types for such small scraper.
BTW
You don't need to use model_dump_json and model_validate_json, just model_dump and model_validate should work as well.
I don't think so. Haven't tested, but I thought that just turns the model to a dictionary. It doesn't solve the problem with datetimes, for example, does it? I mean, if my data contains stuff like sets, dictionaries, or decimals, then
model_dump()
doesn't make it JSON-serializable.
You're right, I mistook the behavior with jsonable_encoder
from FastAPI :facepalm:
I didn't need PyDantic in my code. I used it now only for the purpose of not having to write my own
default=
tojson.dumps()
and polluting my code with that boilerplate, and then having to write something that turns ISO-formatted datetimes back to actual datetimes. PyDantic solves it for me as a workaround, with the added benefit of type checking, but I didn't really need types for such small scraper.
Gotcha. If you had any idea for a better API for user data handling, we're all ears. I'm very reluctant towards the likes of pickle
though...
I think multiprocessing
pickles the data under the hood, hence the segmentation faults, I guess π The problem is that we need to deserialize the types, too. It's not a problem to take whatever users sends our way and stringify it with an enhanced json.dumps()
with default=
which implements few types which are used often. The problem is to pick up the serialized data and re-create the types.
I don't think there's anything else than pickle or pydantic. I mean, widely used in the ecosystem, not something experimental or with marginal popularity. Pickle is native but has issues, pydantic would be a dependency. Maybe a safe subset of pickle? Maybe dataclasses? Ideas:
In the second scenario though Crawlee would have to take care of the serializing and deserializing logic, and basically duplicating at least some of pydantic's job. So might be simpler to just use pydantic for it and happily accept both dataclasses or pydantic models in the argument, if users are keen to pass them:
That way, unless I pass something convoluted with non-serializable types, I should not need to know there's any serialization at all. I send what I have and if that fails, the framework just asks me to send it as a dataclass, which is standard lib. If I'm fancy and keen to read the docs, I can learn that pydantic model would do as well.
I came across a situation where I scrape half of the item's data in the listing page handler and the other half in a handler taking care of the detail page. I think must be quite common case. I struggle to see how I pass the information down from one handler to another. See concrete example below:
I need to scrape
starts_at
at thedefault_handler
, then add more details to the item on the detail page, and calculate theends_at
time according to the length of the film. Even if I changedenqueue_links
to something more delicate, how do I pass data from one request to another?