IgnisDa / ryot

Roll your own tracker!
https://ryot.io
GNU General Public License v3.0
1.94k stars 52 forks source link

High CPU usage #440

Closed ellsclytn closed 8 months ago

ellsclytn commented 1 year ago

I've noticed that my Ryot container enters a state of high CPU usage daily.

image

The above shows CPU usage over a 48h period. It seems to recover temporarily around 00:00, which leads me to believe it's something happening within one of the daily background jobs, but I'm just not sure which one as of yet.

The biggest spike seems to happen around 03:23-03:26.

image

I looked at the logs (where I've set RUST_LOG=ryot=trace,sea_orm=debug), but there doesn't seem to be much to look at beyond a lot of SQL queries.

2023-10-25T03:23:39.161607Z DEBUG sea_orm::driver::sqlx_postgres: SELECT "collection_to_entity"."id", "collection_to_entity"."last_updated_on", "collection_to_entity"."collection_id", "collection_to_entity"."metadata_id", "collection_to_entity"."person_id", "collection_to_entity"."metadata_group_id", "collection_to_entity"."exercise_id" FROM "collection_to_entity" WHERE "collection_to_entity"."collection_id" = 2 AND "collection_to_entity"."metadata_id" = 479 LIMIT 1
2023-10-25T03:23:39.161894Z DEBUG sea_orm::driver::sqlx_postgres: UPDATE "collection_to_entity" SET "last_updated_on" = '2023-10-25 03:23:39 +00:00' WHERE "collection_to_entity"."id" = 10 RETURNING "id", "last_updated_on", "collection_id", "metadata_id", "person_id", "metadata_group_id", "exercise_id"
2023-10-25T03:23:39.163085Z DEBUG sea_orm::driver::sqlx_postgres: SELECT "user_to_entity"."id", "user_to_entity"."last_updated_on", "user_to_entity"."user_id", "user_to_entity"."num_times_interacted", "user_to_entity"."metadata_id", "user_to_entity"."exercise_id", "user_to_entity"."metadata_monitored", "user_to_entity"."metadata_reminder", "user_to_entity"."exercise_extra_information" FROM "user_to_entity" WHERE "user_to_entity"."user_id" = 1 AND "user_to_entity"."metadata_id" = 479 LIMIT 1
2023-10-25T03:23:39.164215Z TRACE ryot::background: Job: "AfterMediaSeen", Time Taken: 28ms, Successful = true
2023-10-25T04:00:00.003823Z TRACE ryot::background: Getting data from yanked integrations for all users
2023-10-25T04:00:00.004961Z DEBUG sea_orm::driver::sqlx_postgres: SELECT "user"."id" FROM "user" WHERE "user"."yank_integrations" IS NOT NULL

This is the last few lines of logs between 2023-10-25T03:23:00 to 2023-10-25T04:26:00.

I'm wondering if anyone is/has experienced similar, or might have some further leads I could explore.

alexk7110 commented 11 months ago

Every night at 00:00, first the ryot container and then the Postgres one spike the CPU to 100%. The first few nights I waited for more than 4 hours for things to calm down, but that never happened. I have a significant collection of People in my database, and I'm wondering if ryot tries to refresh those every night. If yes, then a setting to disable it would be nice, or some different logic on the search. I happen to host on a NUC that turns really loud, thus made me wonder what was causing it.

image

P.S. I cannot enter the ryot container with docker exec -it, I know there is no bash in that tiny container. Is there any way to access the container during runtime ?

IgnisDa commented 11 months ago

@alexk7110 can you try with RUST_LOG=ryot=trace env variable? Ryot refetches data about people every 30 days so I don't think that's the problem.

alexk7110 commented 11 months ago

I'll post the log latter tonight when it happens again.

alexk7110 commented 11 months ago

This is how far it goes, from the moment I switched the dedicated vm on today to the point where it gets to 100% CPU

2023-11-07T13:58:50.958261Z  INFO ryot: Running version: 3.4.4
2023-11-07T13:58:51.052455Z  INFO ryot: Using timezone: Europe/Athens
2023-11-07T13:58:51.061479Z  INFO ryot: Listening on: [::]:8000
2023-11-07T14:00:00.001469Z TRACE ryot::background: Getting data from yanked integrations for all users
2023-11-07T16:00:00.001088Z TRACE ryot::background: Getting data from yanked integrations for all users
2023-11-07T18:00:00.000967Z TRACE ryot::background: Getting data from yanked integrations for all users
2023-11-07T20:00:00.001533Z TRACE ryot::background: Getting data from yanked integrations for all users
2023-11-07T22:00:00.001675Z TRACE ryot::background: Getting data from yanked integrations for all users
2023-11-07T22:00:00.001710Z TRACE ryot::background: Invalidating invalid media import jobs
2023-11-07T22:00:00.002238Z TRACE ryot::background: Cleaning up user and metadata association
2023-11-07T22:00:00.028416Z TRACE ryot::miscellaneous::resolver: Cleaning up media items without associated user activities
2023-11-07T22:00:01.584229Z TRACE ryot::miscellaneous::resolver: Cleaning up genres without associated metadata
2023-11-07T22:00:01.623301Z TRACE ryot::miscellaneous::resolver: Cleaning up people without associated metadata
2023-11-07T22:00:03.164156Z TRACE ryot::background: Removing old user summaries and regenerating them
2023-11-07T22:05:28.046637Z TRACE ryot::miscellaneous::resolver: Cleaning up partial metadata without associated metadata

Here's a Grafana chart with Postgres commits, after the initial spike follows a steady usage that doesn't seem to stop

image

alexk7110 commented 11 months ago

I'm not sure if this will help but what appears to be keeping postgres busy is a rapid-fire of the following SELECT statements that I see on the db after the first 5 minutes of normal updates SELECT "partial_metadata"."id", "partial_metadata"."identifier", "partial_metadata"."title", "partial_metadata"."image", "partial_metadata"."lot", "partial_metadata"."source", "partial_metadata"."metadata_id" FROM "partial_metadata"

IgnisDa commented 11 months ago

What do you mean by "rapid-fire" here? Can you tell me at what interval it is running, how long and possibly the ryot logs during that.

alexk7110 commented 11 months ago

The ryot log is not producing anything after the following line:

Requesting the top command on the Postgres container I see 2 concurrent SELECT statements as above that rapidly come and go and keep the system busy. I don't know how to measure the interval, as for the duration this keeps happening for hours until I manually stop the ryot container because of the fan noise generated from usage. Anyhow, I've put a systemd timer to stop the container at 23:58 and start it again in the morning just to circumvent the unnecessary power draw. If I'm the only one having this issue it might be a me problem.

thespad commented 11 months ago

I'm not seeing any ongoing DB queries once the scheduled jobs complete, just the ryot container CPU usage staying high.

ebiagi commented 11 months ago

The ryot log is not producing anything after the following line:

  • TRACE ryot::miscellaneous::resolver: Cleaning up partial metadata without associated metadata

Requesting the top command on the Postgres container I see 2 concurrent SELECT statements as above that rapidly come and go and keep the system busy. I don't know how to measure the interval, as for the duration this keeps happening for hours until I manually stop the ryot container because of the fan noise generated from usage. Anyhow, I've put a systemd timer to stop the container at 23:58 and start it again in the morning just to circumvent the unnecessary power draw. If I'm the only one having this issue it might be a me problem.

+1, same problems

db machine here's a screenshot of the postgres tansactions per second, skyrocketing around midnight for about half an hour

image

this behaviour seems to happen only the last couple of days, switching from 20/s to 170/s

image

docker machine here's the cpu drop after I restarted the docker image image

thespad commented 11 months ago

The midnight db transaction spikes are expected, it's running its scheduled tasks, it's the sustained CPU usage in the ryot container afterwards that seems to be the anomaly.

IgnisDa commented 11 months ago

I haven't been able to debug this because I have not been able to set up observability in my self-hosted machine. I will look into it once I get time to setup Graphana etc. Until then, I would appreciate whatever insights you have.

thespad commented 11 months ago

As far as I can tell whatever is issue is, it's triggered by the nightly jobs; you can see below that following each nightly spike on the DB there's a rapid climb in the CPU utilisation of the ryot container:

image

Sometimes this continues until the next night's DB jobs, sometimes it finishes sooner. There's never anything in the container logs that indicates ongoing tasks and the DB itself is pretty much idle during this time so it doesn't seem to be related to DB queries.

Restarting the container resets the CPU utilisation until the next night's scheduled jobs run, at which point the process repeats.

IgnisDa commented 11 months ago

During Ryot's "downtime" what is the CPU usage? The scale of the graph you posted says its 0% but it should be more.

thespad commented 11 months ago

It's not zero but it's negligible:

image

Edit: for context that's from this section of the graph:

image

ebiagi commented 11 months ago

The midnight db transaction spikes are expected, it's running its scheduled tasks, it's the sustained CPU usage in the ryot container afterwards that seems to be the anomaly.

that is true but it's kinda strange that it lasts 30 minutes, lot of time

image

IgnisDa commented 11 months ago

Until this is solved, I would suggest people to follow what @alexk7110 does (https://github.com/IgnisDa/ryot/issues/440#issuecomment-1805014041) if the fan level/power draw because of Ryot is too much.

Anyhow, I've put a systemd timer to stop the container at 23:58 and start it again in the morning just to circumvent the unnecessary power draw.

IgnisDa commented 11 months ago

@ebiagi @thespad What monitoring solutions are you using? It would be great if you could link them. The one @ellsclytn linked is too complicated for me to setup 😅.

thespad commented 11 months ago

At a very basic level just cadvisor

cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.2
    container_name: cadvisor
    command:
      - '--docker_only=true'
      - '--disable_metrics=disk,tcp,udp,percpu,sched,process'
      - '--housekeeping_interval=60s'
      - '--store_container_labels=false'
    ports:
      - 8080:8080
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk:/dev/disk:ro
    devices:
      - /dev/kmsg:/dev/kmsg
    restart: unless-stopped
    security_opt:
      - no-new-privileges=true

Though I do then pull it into prometheus and display it via a Grafana dash like https://grafana.com/grafana/dashboards/193-docker-monitoring/, but you could just use the cadvisor UI (it's a bit clunky).

IgnisDa commented 11 months ago

What does your prometheus config look like?

ebiagi commented 11 months ago

@ebiagi @thespad What monitoring solutions are you using? It would be great if you could link them. The one @ellsclytn linked is too complicated for me to setup 😅.

I'm using a local Netdata instance, with only a little configuration you can monitor a lot of stuff.

thespad commented 11 months ago

Very basic

scrape_configs:
  - job_name: 'cadvisor'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['cadvisor:8080']
        labels:
          group: 'cadvisor'
          instance: '<hostname>'
IgnisDa commented 11 months ago

@ellsclytn thanks for the suggestion, it was really easy to setup.

IgnisDa commented 11 months ago

So I have observed the same behavior on my machine, so fortunately I can start debugging on my end.

@thespad Looking at your graphs, the screenshots you attached are for the ryot container or the ryot postgres one? I am observing the errant usage only on the database. The application container reaches just 8.5% for 2 minutes. The database one is sustained at 97% for around 2.5 hrs.

Is it the same for you?

thespad commented 11 months ago

My screenshots included both, the big, short spikes are the DB container at midnight each night, typically for around 30-45 minutes. The long, sustained CPU is the ryot container itself, and can last 24 hours (I assume it would last longer but whatever scheduled processes trigger at midnight seem to reset it).

ebiagi commented 11 months ago

My screenshots included both, the big, short spikes are the DB container at midnight each night, typically for around 30-45 minutes. The long, sustained CPU is the ryot container itself, and can last 24 hours (I assume it would last longer but whatever scheduled processes trigger at midnight seem to reset it).

It lasts longer, the CPU of my docker machine stayed above 30% usage for 2 days before stopping the container and scheduling a restart every night

thespad commented 11 months ago

Last 30 days of CPU for just the ryot container, most of the sustained dips are where the container has been updated (and thus restarted)

image

jakesmorrison commented 10 months ago

I am also seeing some elevated CPU usage. image

IgnisDa commented 10 months ago

Notes to self:

image

IgnisDa commented 10 months ago

Can you people look at the graphs again for the past few days? I have noticed that memory usage has dropped down for mine from the past few days.

thespad commented 10 months ago

Last 7 days CPU and memory (higher mem is the db container). Two big drops are container upgrades/restarts.

image

IgnisDa commented 10 months ago

I removed the background cleanup jobs completely in v3.5.4. They were kinda unnecessary. Please upgrade your instances and report back if the issue is fixed.

thespad commented 10 months ago

So, DB CPU usage has dropped right down, but the ryot container still starts to spike at the same time and continues onward at the same level.

image

Close-up of that point

image

Whatever is happening is definitely being triggered by the same scheduled tasks that cause the DB CPU and memory to spike (which makes sense if they're suddenly running a bunch of queries) but then the ryot process just keeps slowly consuming CPU time. It never goes above 100%, which I'm assuming is because it's single-threaded? Although it's weird that it seems to initially cap out at 50% and then jumps to 100% the next night.

thespad commented 10 months ago

As an addendum, looking at a full 30 days it looks like each "cycle" is causing an increase in RAM use by the ryot container too

image

Though it's not such a big issue because the base memory footprint is so small, and why I hadn't noticed it on shorter timescales.

ellsclytn commented 10 months ago

Similar for me - CPU usage on DB has dropped away, but Ryot itself still suffers high CPU. Memory usage on Ryot seems good to me though (less than 40MB).

ellsclytn commented 10 months ago

I haven't had much time to investigate this recently, but I have at least isolated this down to the v2.19.0 release as the first to introduce the issue for me.

IgnisDa commented 10 months ago

I have started working on https://github.com/IgnisDa/ryot/tree/partial-metadata which is the first step towards fixing the issue. Not sure how long this issue will take to fix since there a lot of changes required.

Drakon74 commented 10 months ago

Sorry for bringing this back up, but i wanted to note that even the idle CPU Usage is rather high.

From what i see here the ryot container alone takes up 2-3% of CPU when just running idle. When you actually do things it can go as high as 5-10%. And that is measured on the Host system.

Though the idle is just throwing me off, i also host a couple other applications in the same docker environment, like a Gitea instance, FoundryVTT, HomeAssistant etc. which, in idle, taking up usually 0.1 - 0.4%. I think gitea is the highest with 0.8%.

So not entirely sure why this particular container takes up a good chunk more CPU when idling.

However, kudos to the super low Memory Usage, takes up only about 12-14 MB, which is barely existent lol. Though i would trade a perhaps little higher memory usage for lower CPU if i am fair :)

IgnisDa commented 10 months ago

I already have a fix in https://github.com/IgnisDa/ryot/pull/518, but I can not merge it because it is blocked by upstream issues. I contacted the author of remix-pwa, they do not have a date for when they will fix this.

I'm sorry, but for the time being, the only thing to do is wait this out.

IgnisDa commented 9 months ago

For everyone following this issue, we released v4.0.8-beta.2 which changes the build to use glibc. Please test with this to see if the issue is mitigated.

IgnisDa commented 9 months ago

image

@vnghia Switching to glibc has definitely helped. Earlier it would spike up at midnight and stay there. Now the CPU usage spikes up for ~4 hours every night and then falls back down to acceptable levels.

thespad commented 9 months ago

It was looking promising until the most recent release (v4.0.13)

image

But it seems that not only the CPU spiking back, but now it's also matched by a huge RAM spike too.

ellsclytn commented 9 months ago

Unfortunately the shape is still broadly similar to me on v4.0.8-beta.2.

image

image

Something else I noticed is that Ryot is performing a disproportionate number of DNS lookups, almost entirely to api.themoviedb.org.

image

I was able to identify this because I have my Ryot container set to use my Adguard server via its own (container) IP. The 98.07% of requests reflects Ryot producing 98.07% of requests on my entire home network + server stack.

IgnisDa commented 9 months ago

Please upgrade to the latest Ryot since these changes have already been included in stable. I'll keep looking for a fix.

thespad commented 9 months ago

Unfortunately the standard Linux DNS resolvers are bad and dumb and don't cache requests, so if you make 500 calls to a site, you're also making 500 DNS requests and it soon adds up. My workaround in the past has been a cron job to fetch the DNS records and add them to the container hosts file so that it doesn't have to do a lookup every time. Something like:

#! /bin/bash

cp /etc/hosts /tmp/hosts.new && sed -i '/api.themoviedb.org/d' /tmp/hosts.new && /usr/bin/dig +short api.themoviedb.org | while read line; do $(echo "$line api.themoviedb.org") >> /tmp/hosts.new; done

echo "Last Updated $(date +'%Y-%m-%d %T')" > /config/dnsupdate.log
IgnisDa commented 9 months ago

Hmm this looks interesting @thespad. Do you think I should include this in the docker image?

thespad commented 9 months ago

It's tricky because it's the kind of things people have Very Strong Opinions about, but personally my view is that if you know you're going to be making a ton of connections to a given domain it's probably wise to try and reduce the network load it generates.

The TTL on the api.themoviedb.org record is the (IMO) very silly AWS default of 60 seconds, but in practice you could probably run the cron job every ~15 minutes and you'd probably be safe. Even if you ran it every minute you'd still only generate 1440 DNS queries a day instead of potentially millions.

ellsclytn commented 9 months ago

image

Looks like the recent releases have improved the resource usage quite a bit, nice work! It still sits around 14% CPU once midnight passes, but that's absolutely a huge improvement.

IgnisDa commented 9 months ago

Yep I'm pretty sure which query is hogging the remaining cpu. I should get around to it by mid Feb.

IgnisDa commented 8 months ago

@ellsclytn Could you upgrade to the latest version and see if this has been fixed?

ellsclytn commented 8 months ago

image

Doesn't look to have made a huge impact unfortunately. I also noticed I seem to be unable to regenerate summaries - it makes me wonder if the application is getting stuck in a loop somewhere.