IgnisDa / ryot

Roll your own tracker!
https://ryot.io
GNU General Public License v3.0
1.94k stars 52 forks source link

High CPU usage #440

Closed ellsclytn closed 8 months ago

ellsclytn commented 1 year ago

I've noticed that my Ryot container enters a state of high CPU usage daily.

image

The above shows CPU usage over a 48h period. It seems to recover temporarily around 00:00, which leads me to believe it's something happening within one of the daily background jobs, but I'm just not sure which one as of yet.

The biggest spike seems to happen around 03:23-03:26.

image

I looked at the logs (where I've set RUST_LOG=ryot=trace,sea_orm=debug), but there doesn't seem to be much to look at beyond a lot of SQL queries.

2023-10-25T03:23:39.161607Z DEBUG sea_orm::driver::sqlx_postgres: SELECT "collection_to_entity"."id", "collection_to_entity"."last_updated_on", "collection_to_entity"."collection_id", "collection_to_entity"."metadata_id", "collection_to_entity"."person_id", "collection_to_entity"."metadata_group_id", "collection_to_entity"."exercise_id" FROM "collection_to_entity" WHERE "collection_to_entity"."collection_id" = 2 AND "collection_to_entity"."metadata_id" = 479 LIMIT 1
2023-10-25T03:23:39.161894Z DEBUG sea_orm::driver::sqlx_postgres: UPDATE "collection_to_entity" SET "last_updated_on" = '2023-10-25 03:23:39 +00:00' WHERE "collection_to_entity"."id" = 10 RETURNING "id", "last_updated_on", "collection_id", "metadata_id", "person_id", "metadata_group_id", "exercise_id"
2023-10-25T03:23:39.163085Z DEBUG sea_orm::driver::sqlx_postgres: SELECT "user_to_entity"."id", "user_to_entity"."last_updated_on", "user_to_entity"."user_id", "user_to_entity"."num_times_interacted", "user_to_entity"."metadata_id", "user_to_entity"."exercise_id", "user_to_entity"."metadata_monitored", "user_to_entity"."metadata_reminder", "user_to_entity"."exercise_extra_information" FROM "user_to_entity" WHERE "user_to_entity"."user_id" = 1 AND "user_to_entity"."metadata_id" = 479 LIMIT 1
2023-10-25T03:23:39.164215Z TRACE ryot::background: Job: "AfterMediaSeen", Time Taken: 28ms, Successful = true
2023-10-25T04:00:00.003823Z TRACE ryot::background: Getting data from yanked integrations for all users
2023-10-25T04:00:00.004961Z DEBUG sea_orm::driver::sqlx_postgres: SELECT "user"."id" FROM "user" WHERE "user"."yank_integrations" IS NOT NULL

This is the last few lines of logs between 2023-10-25T03:23:00 to 2023-10-25T04:26:00.

I'm wondering if anyone is/has experienced similar, or might have some further leads I could explore.

IgnisDa commented 1 year ago

How much memory is used?

ellsclytn commented 1 year ago

image

Not much at all. Highest point is 64.1 MB

IgnisDa commented 1 year ago
ellsclytn commented 1 year ago

Thanks for helping out, will send across an export soon.

Could you tell me what sort of summary you're after? I.e. just the same stuff as I opened the issue with, or some more things.

Neofetch:

OS: Ubuntu 22.04.3 LTS x86_64
Host: B550I AORUS PRO AX
Kernel: 5.15.0-86-generic
Uptime: 7 days, 6 hours, 4 mins
Packages: 1014 (dpkg)
Shell: bash 5.1.16
Terminal: /dev/pts/0
CPU: AMD Ryzen 5 5500 (12) @ 3.600GHz
GPU: NVIDIA GeForce GTX 1070
Memory: 13628MiB / 15830MiB
IgnisDa commented 1 year ago

what sort of summary you're after

Just the one on your Ryot dashboard

Also what is your architecture? I remember someone reporting high cpu usage on arm64.

ellsclytn commented 1 year ago

Oh, that dashboard. Of course :grinning:

image

I'm on amd64. I'll also mention I'm on v3.0.0 of Ryot, though this has been happening for me since at least v2.24.0 (possibly earlier, unfortunately I didn't keep track that far back).

IgnisDa commented 1 year ago

Try updating Ryot to latest and see it it improves (i doubt it).

What do you use to monitor your server? I don't have any monitoring on my server so I don't know what an ideal cpu footprint should look like.

ellsclytn commented 1 year ago

Will do. I'll give it a day at least after updating before making any judgement calls on it. Just emailed you the export, too.

I'm using Grafana with Cadvisor and Prometheus. And then this dashboard template, along with pretty similar config to what they have in the readme (might have tweaked it a little bit for my setup).

IgnisDa commented 1 year ago

Can you tell me the versions of the prometheus, node-exporter, cadvisor, grafana? I suspect the dashboard template is outdated. It uses latest tags.

ellsclytn commented 1 year ago

Grafana: grafana-oss:9.1.1 Cadvisor: 0.47.2 node-exporter: 1.6.1 Prometheus: 2.47.2

I've also put my version of the dashboard (admittedly I've deleted some of the panels I never used) into a gist.

IgnisDa commented 1 year ago

Thanks!

ellsclytn commented 12 months ago

Unfortunately the behavior hasn't changed since updating. I might be a bit slow to get back to this, but I'm thinking of spinning up a separate Ryot instance, importing specific sections at a time (e.g. movies, shows, books) and observing the results to see if I can narrow down further.

thespad commented 12 months ago

I'm seeing the same issue with high CPU on v3.2.5. Last 2 days: image

Similarly low memory usage: image

image

Also seeing big CPU spikes on the DB every night for 2 hours image

I'm assuming some kind of scheduled tasks running, but 2 hours is a long time to be spiking like that.

IgnisDa commented 12 months ago

Can you set env variable RUST_LOG=ryot=trace and then go to miscellaneous settings and recalculate summary. A log should appear telling you how much time the calculation step took. Does it take 2 hours?

ellsclytn commented 12 months ago

Mine was super quick! 23 seconds.

2023-10-29T03:00:36.036545Z TRACE ryot::background: Started job: "RecalculateUserSummary"
2023-10-29T03:00:59.599454Z TRACE ryot::background: Job: "RecalculateUserSummary", Time Taken: 23562ms, Successful = true
IgnisDa commented 12 months ago

Then it is not the summary calculation that takes up CPU. I recommend you wait for another day and observe the logs around the time when the CPU spike happens and see which job is causing them.

You can also change the timezone (via TZ env var) if you want the job to happen sooner.

IgnisDa commented 12 months ago

Another possible cause: do you have a lot of media in the watchlist and in progress collections? Also do you have a lot of explicitly monitored stuff (there's a filter for it in the list page)?

ellsclytn commented 12 months ago

image

Not a huge number there. Similarly for Explicitly Monitored, where the only thing that came up was 7 TV Shows.

I've also imported my export into a separate, v2.0.0 instance of Ryot, just to see if it behaves the same. I'll keep an eye on it and slowly bump the version on that second instance until I start seeing the behavior.

IgnisDa commented 12 months ago

Cool thanks!

IgnisDa commented 12 months ago

I am also fairly certain this usage spike is localized to your instance. Here are the metric of the public demo instance (which has a lot more media than yours):

image

So there must be something different about your setup that's causing the spikes.

IgnisDa commented 12 months ago

Can you also attach usages of the database container? Maybe something is blocking there.

ellsclytn commented 12 months ago

I currently share the Postgres instance with a number of other services. I'll spin up a dedicated postgres instance so I can observe it more accurately. I'll post some screenshots in a day or two when there's a good amount of data there.

thespad commented 12 months ago

OK so here is a week's worth of data, as you can see the CPU spikes happen every day at 7am (6am today because the clocks went back last night). I've included the DB container too (it's only running the ryot db). Not sure why Friday and today are higher spikes than the other days.

image

IgnisDa commented 12 months ago

@thespad your usage pattern seems fine to me. Those are just the daily cron jobs running.

thespad commented 12 months ago

Running for 3-18 hours every day? That seems excessive.

IgnisDa commented 12 months ago

I suggest you do this (https://github.com/IgnisDa/ryot/issues/440#issuecomment-1783974800) too and report back the times.

thespad commented 12 months ago

This is now with v3.3.2

ryot  | 2023-10-29T13:20:50.030129Z TRACE ryot::background: Started job: "RecalculateUserSummary"
ryot  | 2023-10-29T13:21:15.916227Z TRACE ryot::background: Job: "RecalculateUserSummary", Time Taken: 25886ms, Successful = true
ryot  | 2023-10-29T13:21:46.044682Z TRACE ryot::background: Started job: "RecalculateCalendarEvents"
ryot  | 2023-10-29T13:22:15.154754Z DEBUG recalculate_calendar_events: ryot::miscellaneous::resolver: Finished deleting invalid calendar events
ryot  | 2023-10-29T13:22:15.156104Z DEBUG recalculate_calendar_events: ryot::miscellaneous::resolver: Finished updating calendar events
ryot  | 2023-10-29T13:22:15.156146Z TRACE ryot::background: Job: "RecalculateCalendarEvents", Time Taken: 29111ms, Successful = true
ryot  | 2023-10-29T13:22:38.020362Z TRACE ryot::background: Started job: "UpdateMetadata"
...
ryot  | 2023-10-29T13:24:38.428239Z TRACE ryot::miscellaneous::resolver: Updating metadata for 331
ryot  | 2023-10-29T13:24:38.533986Z TRACE ryot::background: Started job: "AssociatePersonWithMetadata"
...

This is still going after 10 minutes, I'll let you know when it finally completes

thespad commented 12 months ago

Though I'll note that: image

CPU usage (obviously) dropped to almost zero when I recreated the container to update the logging envs, and has not returned to the previous levels while running any of these jobs, which suggests that however long they take they are not (on their own) responsible for the high CPU usage.

thespad commented 12 months ago

The AssociatePersonWithMetadata job is still running after 1.5 hours but still basically no CPU activity

image

IgnisDa commented 12 months ago

The AssociatePersonWithMetadata job is still running after 1.5 hours but still basically no CPU activity

Pretty sure it's not running since it is just one HTTP and DB call. Logging must have been lost due to an error. I haven't really done good error reporting in the code.

It would be great if you could associate the CPU spike to a specific job/event. Next time you observe a spike, just have a look at the time on the graph and then the corresponding logs of the Ryot container.

thespad commented 12 months ago

To be clear not a single AssociatePersonWithMetadata job but the run of all of them after triggering the Metadata update task under Misc:

ryot  | 2023-10-29T16:32:36.606765Z TRACE ryot::background: Started job: "AssociatePersonWithMetadata"
ryot  | 2023-10-29T16:32:36.608172Z TRACE ryot::background: Job: "AssociatePersonWithMetadata", Time Taken: 1ms, Successful = true
ryot  | 2023-10-29T16:32:36.683708Z TRACE ryot::background: Started job: "AssociatePersonWithMetadata"
ryot  | 2023-10-29T16:32:36.685218Z TRACE ryot::background: Job: "AssociatePersonWithMetadata", Time Taken: 1ms, Successful = true
ryot  | 2023-10-29T16:32:36.755320Z TRACE ryot::background: Started job: "AssociatePersonWithMetadata"
ryot  | 2023-10-29T16:32:36.756347Z TRACE ryot::background: Job: "AssociatePersonWithMetadata", Time Taken: 0ms, Successful = true
ryot  | 2023-10-29T16:32:36.827786Z TRACE ryot::background: Started job: "AssociatePersonWithMetadata"
ryot  | 2023-10-29T16:32:36.828977Z TRACE ryot::background: Job: "AssociatePersonWithMetadata", Time Taken: 1ms, Successful = true
ryot  | 2023-10-29T16:32:36.900988Z TRACE ryot::background: Started job: "AssociatePersonWithMetadata"
ryot  | 2023-10-29T16:32:36.901980Z TRACE ryot::background: Job: "AssociatePersonWithMetadata", Time Taken: 0ms, Successful = true
ryot  | 2023-10-29T16:32:37.036672Z TRACE ryot::background: Started job: "AssociatePersonWithMetadata"
ryot  | 2023-10-29T16:32:37.037981Z TRACE ryot::background: Job: "AssociatePersonWithMetadata", Time Taken: 1ms, Successful = true
IgnisDa commented 12 months ago

I don't understand what you mean. Looks like all the jobs are passing?

thespad commented 12 months ago

I mean the "Update Metadata" process is still running tasks after 3 hours, but there's no corresponding CPU spike. So while it's the only one of the user-triggerable processes that's taken more than a minute to run, it doesn't seem like it can be responsible (on its own) for the high CPU usage that I've been seeing.

IgnisDa commented 12 months ago

Then that job must also be done. Where did you run that job from?

thespad commented 12 months ago

image

And it's still running new jobs (It's 17:40 right now) but there's no indication of overall progress or what the individual tasks are referring to so I've no insight as to how much longer it might be running for. I'm assuming a task for every person, which means just over 30k jobs, which would mean probably 4-8 hours at a guess based on how fast they're actually running (I know it says 1ms but the setup + teardown means the total job time is much longer than that in practice - seems to be about 0.5 seconds each): image

ryot  | 2023-10-29T17:39:58.012269Z TRACE ryot::background: Started job: "AssociatePersonWithMetadata"
ryot  | 2023-10-29T17:39:58.013338Z TRACE ryot::background: Job: "AssociatePersonWithMetadata", Time Taken: 1ms, Successful = true
ryot  | 2023-10-29T17:39:58.094844Z TRACE ryot::background: Started job: "AssociatePersonWithMetadata"
ryot  | 2023-10-29T17:39:58.095967Z TRACE ryot::background: Job: "AssociatePersonWithMetadata", Time Taken: 1ms, Successful = true
ryot  | 2023-10-29T17:39:58.243872Z TRACE ryot::background: Started job: "AssociatePersonWithMetadata"
ryot  | 2023-10-29T17:39:58.245019Z TRACE ryot::background: Job: "AssociatePersonWithMetadata", Time Taken: 1ms, Successful = true
IgnisDa commented 12 months ago

Yeah thats a lot of people. I have double your media but half your people. I wonder why. image

Might be because you have a lot of movies/shows and I have a lot of books and audiobooks (which have less people involved).

IgnisDa commented 12 months ago

which would mean probably 4-8 hours at a guess based on how fast they're actually running (

I don't have any calculation or concrete number but the last time I refreshed my metadata, it took around 4.5 hrs. Yours will 8+ then.

I know it says 1ms but the setup + teardown means the total job time is much longer than that in practice - seems to be about 0.5 seconds each

Jobs are rate limited to 1 per second.

thespad commented 12 months ago

I suspect it's the number of episodes rather than the number of Movies/Shows total, in my case it's nearly 8.5k epiodes which will mean a lot of people.

It doesn't seem sustainable to do individual metadata fetch jobs on a regular basis if you've got tens of thousands of people - at a certain size of library the refresh is going to start taking more than 24 hours.

Although the manual run doesn't seem to have caused high CPU the timeframe does fit with the historic spikes so it could be a combination of factors when it's running on its schedule.

IgnisDa commented 12 months ago

It doesn't seem sustainable to do individual metadata fetch jobs on a regular basis if you've got tens of thousands of people - at a certain size of library the refresh is going to start taking more than 24 hours.

Yeah I want to address this via #437 but that will take a long time since it's very complicated and i haven't been able to think of a solution yet.

thespad commented 12 months ago

I've not looked at your DB schema but based on how the tasks are running I would have thought that a relatively straightforward way to reduce the processing time would be to associate the people to the media rather than the media to the people. That way you only have to process \<number of media items> jobs instead of \<number of people> jobs.

Alternatively, or also, reduce or remove (or give an option to) the job rate limiting. 1ms update jobs would potentially be 1000 times faster without the 1s rate limiting (though I'm assuming there are reasons you rate limited them).

thespad commented 12 months ago

OK, so jobs just finished

ryot  | 2023-10-29T18:10:43.066644Z TRACE ryot::background: Started job: "AssociatePersonWithMetadata"
ryot  | 2023-10-29T18:10:43.068349Z TRACE ryot::background: Job: "AssociatePersonWithMetadata", Time Taken: 1ms, Successful = true

So ~5 hours. Interestingly: image

Pretty much exactly when those jobs stopped, the CPU spiked back up again and there's nothing in the logs to match up

ryot  | 2023-10-29T18:10:43.068349Z TRACE ryot::background: Job: "AssociatePersonWithMetadata", Time Taken: 1ms, Successful = true
ryot  | 2023-10-29T18:32:26.457219Z ERROR ryot::routes: Error { message: "Ignoring non matching user \"xxxxxx\"", extensions: None }
IgnisDa commented 12 months ago

Are you aware of any tools that can help me debug this? Admittedly I've never had to debug CPU usage before.

thespad commented 12 months ago

I don't know much about the Rust ecosystem but perhaps something like https://github.com/flamegraph-rs/flamegraph

IgnisDa commented 12 months ago

Alternatively, or also, reduce or remove (or give an option to) the job rate limiting

This is already possible using the SCHEDULER_RATE_LIMIT_NUM env variable (default 5).

thespad commented 12 months ago

Upped the limit to 50 but it doesn't seem to make a difference to this issue (though it does dramatically speed up a manual metadata refresh); there aren't any running tasks (according to the logs) while the CPU usage is high in any case.

The pattern seems to be that after each night's DB mainteance tasks the CPU usage starts creeping up again and then suddenly spikes, and if whatever it's doing is still going on come the next night then once the DB maintenance completes the CPU in the ryot container resets back down.

image

IgnisDa commented 12 months ago

Do you have your timezone synced to this graph?

thespad commented 12 months ago

The times are UTC, which is also my local timezone right now (GMT).

IgnisDa commented 12 months ago

The screenshot shows the spikes but what number does it reach?

thespad commented 12 months ago

On that graph DB peaks at 100%, Ryot at 80%, lower spike is at 50%. This is an 8 core box though, so in the grand scheme of things ryot is only peaking at ~15%, but that's nearly doubling the normal CPU load of the host.

IgnisDa commented 12 months ago

Alright I'll try to find other ways to debug it.