denoland / deno

A modern runtime for JavaScript and TypeScript.
https://deno.com
MIT License
97.49k stars 5.37k forks source link

100% CPU usage #23033

Open oscarotero opened 7 months ago

oscarotero commented 7 months ago

I have a server with Deno running and every few days, the server got stuck due 100% CPU usage:

imaxe

Restarting the process fixes the problem. I found several bugs about Deno using 100% CPU so not sure if Deno is intended to be used as a server in production? Is there any way to debug this in order to know its cause?

According to the server graphs, it's not somethig progressive, but goes suddenly, and there isn't anything suspicious in other graphs like disk, network etc.

imaxe

Version: Deno 1.41.0

jtoppine commented 7 months ago

It's a long shot, but does your server happen to be on AWS?

Asking because I had exact same problems during a couple of days about a week ago (same symptoms, that is). Deno has been running very reliably for months, then suddenly these 100% CPU consumption freezes once or twice a day. Indeed it is very hard or perhaps impossible to debug such an issue.

I have never encountered such problems with Deno before - maybe with LSP, but never in a server application.

I enabled some additional debug-level logging in my server app, and the conclusion was... that there was no conclusion - the freezes did not coincide with any particularly complex or demanding requests, high load, weird spam requests, or anything, it appeared as if the issue triggered completely randomly. Which makes me a little suspicious there might have something wonky happening on the AWS (lightsail) virtual environment that triggers the issue. Like, I don't know, maybe there has been some internal network config changes or something on the cloud provider, that could have somehow put the running Deno process into an unstable state. Just guessing, really.

In any case, it hasn't happened anymore for about a week now, so I'm hopefull the issue might have fixed itself, fingers crossed. But I have also upgraded Deno through all the three 1.41.x minor releases during this episode, which further complicates attempts to get to the root of the issue, if the issue exists at all anymore that is.

Not very helpfull, I know. But it's something :)

lucsoft commented 7 months ago

Are you sure its really a deno issue we had cases where a library had bad code in it and just had silent errors in an internal cleanup job resulting in an hidden while loop, yeah that was fun to debug.

Maybe try to leave the debugger open and when it happens again just connect via the debugger?

jtoppine commented 7 months ago

Are you sure its really a deno issue we had cases where a library had bad code in it and just had silent errors in an internal cleanup job resulting in an hidden while loop, yeah that was fun to debug.

Here's a thought: What if OP and I, since we probably ran into the same issue, could check if we have some common libraries we both use. Maybe that could help identifying those dependencies as possible suspects.

Here's what I have outside of private code or Deno std, it's not much:

npm:mongodb@6.5.0 npm:deepl-node@1.12.0 https://deno.land/x/s3@0.5.0/ https://deno.land/x/xml2js@1.0.0/ https://deno.land/x/sqlite@v3.8/

oscarotero commented 7 months ago

In my case, I'm using systemd to run the Deno server. I just increased the number of file descriptors from 1024 (the default value) to 65535, hopping that this was the cause of the issue.

I'll report here if it happens again.

lucsoft commented 7 months ago

This sounds like pure snake oil :D, try to actually measure it. if you again see it going to 100% cpu capture a process checkpoint, or as i have stated use a or the debugger.

guy-borderless commented 7 months ago

Did you rule out garbage collection causing the spike?

oscarotero commented 7 months ago

Yesterday it happened again, so the cause is not file descriptors. I'm not a sre or something like this, so no idea how to profile this bug.

My server is using a file watcher to detect changes in the files, I don't know if this can affect, although no changes happened before the spike.

lucsoft commented 7 months ago

@oscarotero did you look into attaching a debugger to it?

oscarotero commented 7 months ago

@lucsoft how can I do that? My server was configured with this script: https://github.com/lumeland/cms-deploy/blob/main/install.sh

But I'm planning to change to this one: https://github.com/lumeland/cms-deploy/blob/main/install-caddy.sh that uses Caddy as a reverse proxy.

I'm working right now on a cron job to restart the process if the cpu usage is above 95%.

bartlomieju commented 5 months ago

Hi @oscarotero, do you still experience this problem? Is there some reproduction code that we could try on our end? As for profiling - I think the best approach here would be to run with --inspect flag - then you can connect Chrome Devtools to your running instance and take some performance profiles - this article gives a good rundown on taking such profiles.

oscarotero commented 5 months ago

@bartlomieju Profiling is not easy, because it's a script running forever in a server.

My use case is a VPS running a Deno script that opens a web server. It works fine for days (or weeks) but suddenly the CPU goes to 100% and the server doesn't respond anymore, until I restart the script. The script not only starts a server but also use Deno.watchFs to watch changes and reload the site, so maybe the issue is related with the watcher.

I could fix it by creating a cron that restart the service if the CPU usage reaches 99% (you can see it here).