apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
4.62k stars 319 forks source link

Memory usage average periodically printed from `AutoscaledPool` seems wrong #662

Closed janbuchar closed 3 days ago

janbuchar commented 1 week ago

This issue was observed on the Apify platform - see for example https://api.apify.com/v2/logs/g7piQaNbAdteXnkFC. The problem is that the average memory usage starts at 0 and suddenly hops to 1.0. I can testify that the increase was way more gradual :slightly_smiling_face:

2024-11-06T15:11:50.471Z [crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 1; desired_concurrency = 1; cpu = 0.581; mem = 0.0; event_loop = 0.227; client_info = 0.0
2024-11-06T15:12:50.474Z [crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 1; cpu = 0.491; mem = 0.0; event_loop = 0.1; client_info = 0.0
2024-11-06T15:13:50.487Z [crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 1; cpu = 0.671; mem = 0.0; event_loop = 0.104; client_info = 0.0
2024-11-06T15:14:50.496Z [crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 3; cpu = 0.451; mem = 0.206; event_loop = 0.039; client_info = 0.0
2024-11-06T15:15:50.491Z [crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 1; cpu = 0.242; mem = 1.0; event_loop = 0.0; client_info = 0.0
vdusek commented 1 week ago

@Pijukatel, I assigned you this as it could be an opportunity to look into AutoscaledPool & Snapshotter & EventManager. Let me add a little more context...

Honza noticed this issue on the platform. It means using the PlatformEventManager, which receives CPU and memory usage data from the platform via WebSocket messages. We store these snapshots and compute memory load based on the N latest ones. There is a possibility that we may be computing this incorrectly.

When running locally, we measure these metrics manually using the LocalEventManager (and related components). You can also verify this with a local setup to see if you observe any inconsistencies there as well.

Pijukatel commented 6 days ago

There are bunch of issues connected to the mentioned "mem" measurement. It works ok-ish, but I found some issues that can make it's value wrong in some cases.

Interpretation: Printed mem is not memory usage ratio. It is the time weighted ratio of samples when memory usage was higher than some defined threshold (0.9 I think). That makes it very hard to interpret by bare eye as it does not make much sense to human understanding especially with the "sampling luck", which basically means that this measurement can skip some samples based on some timing context.

Example (for simplification lets expect all samples to have same time weight): So for example if memory usage ratio was 0.89 for couple of samples, then mem would show 0. [0.89,0.89,0.89,0.89,0.89,0.89] -> mem=0 Then memory usage ratio jumps to 91 % and depending on the "sampling luck" of the print function it can directly jump to mem = 1 or some intermediate 0<mem<1.

[0.89,0.89,0.89,0.91,0.91,0.91] -> mem=0.5 [0.91,0.91,0.91,0.91,0.91,0.91] -> mem=1

Sample time order Some code does some expectations about sample order of the sampling snapshot. But I did bunch of experiments and sample order is not guaranteed! Commonly sample snapshot is ordered from oldest to newest. But it is not enforced and sometimes this expectation is not valid. This can lead to errors in calculation when "sample duration" is negative, because previous can be newer than current

Some minor details Fake time duration for same samples When two measurement snapshots are from the same time, then they are redundant information, but there is an constant that gives them "fake duration of 0.001" and thus such samples can taint further calculation. It is small number and thus I doubt it has any significant impact, but why have it there at all? I think it is there to avoid division by zero down the line, but there are other ways how to handle that, for example:

Unnecessary reverse Even though most of the code expects time ordered smaples at some point one processing step reverses the order. This makes it inconsistent ordering within the same function as it can under different circumstances return normally ordered samples

janbuchar commented 6 days ago

@Pijukatel thanks for the detailed report! Could you compare the problematic parts with the javascript implementation here?

Printed mem is not memory usage ratio. It is the time weighted ratio of samples when memory usage was higher than some defined threshold (0.9 I think). That makes it very hard to interpret by bare eye as it does not make much sense to human understanding

This is the root of the issue! If something is not understandable to a human, why print it at all? Is there a number that would help the user estimate the current memory load better?

Pijukatel commented 5 days ago

In JS it prints here, that is populated here and memory stuff is filled here. So with my limited understanding of JS I would say that it is implemented in the same way.

I think that more readable number for the user would be just the ratio of used_memory/maximum_memory. (or some time-processed variant of this easy to understand number. Time weighted average from last x samples,...)

janbuchar commented 5 days ago

In JS it prints here, that is populated here and memory stuff is filled here. So with my limited understanding of JS I would say that it is implemented in the same way.

Good to hear that.

I think that more readable number for the user would be just the ratio of used_memory/maximum_memory. (or some time-processed variant of this easy to understand number. Time weighted average from last x samples,...)

Now that I think of it... it doesn't seem sane to use a summarization that can do e.g. [0.91,0.91,0.91,0.91,0.91,0.91] -> mem=1 as input for autoscaling. Or am I missing some detail? If not, maybe we should rethink this in both implementations.

Any input, @vdusek and @B4nan?

B4nan commented 5 days ago

This was implemented years ago, before I joined, and it was always a source of confusion for me. I can't help with what you have in python, and I don't have time to deep dive in the JS version now either, this endless context switching is killing me this week 😭

Pijukatel commented 5 days ago

Minor fixes for issues found during investigation: https://github.com/apify/crawlee-python/pull/692 Proposal for different variable to be shown to user will be in separate PR.

vdusek commented 5 days ago

Thanks @Pijukatel for the detailed report! First, could you please within the PR update the Snapshotter class docstring as well? I would say you can go a little bit deeper, and describe how exactly are metrics computed. So that next time, we have a other source than just studying the source code.

(I will comment the rest later.)

Pijukatel commented 5 days ago

Logged variables are tightly connected to the desired concurrency controller in autoscale pool. Showing values that are understandable to the user, but not really used in the controller could cause misinterpretation of the logs as well. Since current "hard to understand" values are directly used in the controller I would not recommend changing the logging unless we change the controller as well and that seems out of scope of this task from my point of view.

vdusek commented 5 days ago

Printed mem is not memory usage ratio. It is the time weighted ratio of samples when memory usage was higher than some defined threshold (0.9 I think). That makes it very hard to interpret by bare eye as it does not make much sense to human understanding

Yeah, back then I implemented this in the same way as it was in the TS implementation.

Now that I think of it... it doesn't seem sane to use a summarization that can do e.g. [0.91,0.91,0.91,0.91,0.91,0.91] -> mem=1 as input for autoscaling. Or am I missing some detail? If not, maybe we should rethink this in both implementations.

I would be in favor of updating it to something more intuitive, assuming we are not overlooking anything. Couldn't we simply use a (time-weighted) average of the samples themselves, rather than a ratio of overloaded to non-overloaded samples? I believe AutoscaledPool can decide the scaling on this time-weighted average as well.

Logged variables are tightly connected to the desired concurrency controller in autoscale pool. Showing values that are understandable to the user, but not really used in the controller could cause misinterpretation of the logs as well. Since current "hard to understand" values are directly used in the controller I would not recommend changing the logging unless we change the controller as well and that seems out of scope of this task from my point of view.

Agreed; let's leave it as is for now. At least in the current state, we are printing the data used by the AutoscaledPool, even if they are not easily understandable. If we decide to improve this, let's do it gradually as a follow-up task.

janbuchar commented 5 days ago

Random thought - maybe we could just not log the number, or at least change the message from mem = 0.42 to something less... suggestive?

B4nan commented 4 days ago

cc @petrpatek since I believe he was initially implementing this in the JS version, maybe he remembers why its like that, but it's been a long time ago :]

edit: but based on git blame that must have been ages ago, I can only see Ondra refactoring all of that

Pijukatel commented 3 days ago

Closed by https://github.com/apify/crawlee-python/pull/692

B4nan commented 3 days ago

Next time add Closes #629 to the PR description so its linked and the issue gets closed automatically. That way the issue is also referenced in the changelog.

janbuchar commented 3 days ago

I'm not so sure about closing this one though - maybe we should make a follow up issue because the problem still exists (AFAIK)