Azure / Azurite

A lightweight server clone of Azure Storage that simulates most of the commands supported by it with minimal dependencies
MIT License
1.83k stars 325 forks source link

Extension causes high cpu load #939

Open Sti2nd opened 3 years ago

Sti2nd commented 3 years ago

Azurite.azurite-unresponsive.cpuprofile.txt

Find more details here: https://github.com/microsoft/vscode/wiki/Explain-extension-causes-high-cpu-load

blueww commented 3 years ago

@Sti2nd

Thanks for reporting the issue! Would you please share what operation you have done to Azurite o repro this?

Sti2nd commented 3 years ago

@blueww I just opened VS Code! I simply had Azure installed and opened VS Code; I didn't do anything with the extension at the time 😅

edwin-huber commented 3 years ago

Hi, it looks to me like it was spending quite a bit of time with module loading. I have a couple of clarifying questions, if that is OK?

  1. Were there any symptoms, i.e. was the system slow to respond while things loaded, or was this just reported by VSCode?

  2. Were you doing other work, or was this in a virtualized machine with limited IO (network or disk)?

Sti2nd commented 3 years ago

Hi, sorry for the delay, here are some answers:

  1. It could be the entire system was slow at that time, yes. And this was reported by VS Code, yes. No other extensions was reported.
  2. I was not doing any other work as I can remember. Other programs were open, but I don't think they were using much resources (apart from the normal resources stealers like anti virus and Windows Update). This was VS Code running directly on my machine, no virtualization nor remote.
edwin-huber commented 3 years ago

Thanks for getting back to us.
I am fairly sure that we don't, but we need to validate that we are not loading an overly large number of modules.
As this is not a systematic issue, I assume that it is a symptom of a system under load from somewhere else, and the slow IO is why VS Code thinks that the module load is the problem.
We shall knock our heads together and see if there is a way to get more certainty on this.

ddzialak commented 2 years ago

I have observed similar issue with a docker image which we are using for testing. For daily usage its working perfectly, but after weekend (we run a lot of semi-long tests over the weekend) the docker gets into some troubles, requests are failing (usually due timeouts), one CPU works at 100%. Our tests do usually basic operations: create new bucket, create few files (few tests creates a bit more, one creates around 250 files and another 10k files), then read what was stored, sometime overwrite data, and after all delete bucket.

azurite version: 3.17.1

We configured azurite to work with a debug logs and found suspicious logs about "operations.length":

2022-07-03T07:54:46.754Z d4589bc5-9168-48ed-9466-0d54216433ad debug: OperationQueue:execute() Current runningConcurrency:49 maxConcurrency:50 operations.length:1
2022-07-03T07:54:46.754Z d4589bc5-9168-48ed-9466-0d54216433ad debug: OperationQueue:execute() Current runningConcurrency:0 maxConcurrency:100 operations.length:0
2022-07-03T07:54:46.786Z 3a5772db-1673-4d19-9428-8531765d84dc debug: OperationQueue:execute() Current runningConcurrency:0 maxConcurrency:100 operations.length:1
2022-07-03T07:54:46.786Z 3a5772db-1673-4d19-9428-8531765d84dc debug: OperationQueue:execute() Current runningConcurrency:49 maxConcurrency:50 operations.length:0
2022-07-03T07:54:46.786Z 3a5772db-1673-4d19-9428-8531765d84dc debug: OperationQueue:execute() Current runningConcurrency:0 maxConcurrency:100 operations.length:0
2022-07-03T07:54:46.822Z 6809184a-5559-4682-abf1-f4f5be37ade4 debug: OperationQueue:execute() Current runningConcurrency:49 maxConcurrency:50 operations.length:1
2022-07-03T07:54:46.825Z 70f19825-ceaf-488d-977f-2cf74ae0f6eb debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:1
2022-07-03T07:56:15.565Z 0a4c459c-166b-40e3-af12-d4792dcf4e87 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:2
2022-07-03T07:56:34.296Z 454908b5-2163-43a6-87e4-c5ca168c61c8 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:3
2022-07-03T07:56:45.214Z 502b6861-a016-4781-8436-0b961fe85e73 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:4
2022-07-03T07:56:58.062Z cf28e6b5-6eba-422c-a6c7-d1eb8f17ca19 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:5
2022-07-03T07:57:02.891Z 1ba4442e-81a5-4e07-8b3c-d143f0136cb4 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:6
2022-07-03T07:57:07.694Z 2f486ea5-6a16-4306-ae26-510cefcd6cb2 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:7
2022-07-03T07:57:13.067Z a1f967c5-69d5-48a1-9694-d90d62f36f86 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:8
2022-07-03T07:57:27.160Z 6f7e2d9d-b109-4349-8d86-af44d1cd6c60 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:9
2022-07-03T07:57:30.706Z e805954e-b0a6-4a2a-8b3b-d78c32273c82 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:10
2022-07-03T07:57:42.692Z 276de80b-5a3c-4f76-8354-bb65fbbc4f2a debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:11
2022-07-03T07:57:53.850Z 0517248f-8266-430c-a447-9d9d8c524b2b debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:12
2022-07-03T07:58:01.482Z 2d5d403b-5c41-4c11-a37e-cf1d90de1ebc debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:13

2022-07-04T08:11:16.420Z a64c0b52-1d92-4331-8d88-489edea6e3f0 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:14606

and time, when it started to increase corresponds to the time, when one CPU started to work at 100%: image

docker stop azurite and docker start azurite cause that CPU usage get back to normal.... but next 2 weekends cause the same situation.

edwin-huber commented 2 years ago

Thanks for the detailed information. The longer your tests run, the larger the database / files used for persistence, and the larger the debug log. A large amount of data in Azurite will mean more CPU load and memory are required to serve the database and indexes used to persist data for the API emulation. This can lead to contention for resources and lead to symptoms like 100% on some systems. Recommendations:

ddzialak commented 2 years ago

To precise some previous information:

edwin-huber commented 2 years ago

Thanks for the clarifications, recommendations stand for all readers at the time of writing, even if not all apply to you 😺

Could you explain "more hosts execute them"? Does this mean that you have a single host running Azurite, to which multiple hosts / client processes are connecting?

edwin-huber commented 2 years ago

I can't really see how you are processing these items. If the filesystem cannot keep up, this could be blocking IO and causing the issue.

To properly help determine the root cause, we would need a more complete debug log of the time when requests are building up, as well as more information about the system used to execute / host the Azurite container.

If it is an issue with IO, we could potentially disable persistent storage, and thereby avoid using the file system to maximize throughput.

ddzialak commented 2 years ago

"more hosts execute them"

We have a bunch of hosts for our tests, and from Monday to Friday these hosts execute tests for pull requests (around 25 tests are using azurite) and it's working. At the weekend all hosts executes all tests in a cycle so number of executions increase significantly.

Does this mean that you have a single host running Azurite, to which multiple hosts / client processes are connecting?

yes, one host has a docker with azurite, each testing host is using the same endpoint. It was working quite a long time but after upgrade (I'm not sure what was a previous version of azurite) it started to get into "bad state".

If the filesystem cannot keep up

no, persistent storage is just a docker bind volume to host (standard ext4 partition, it was like that also before azurite upgrade).

If it is an issue with IO, we could potentially disable persistent storage, and thereby avoid using the file system to maximize throughput.

It's rather not related to IO performance: image

ddzialak commented 2 years ago

Part of logs from azurite when the problem happened first time: az.log.gz The operations.length started to grow around 2022-07-03T07:56:34 and few hours(?) later CPU usage jump to 100%, (see previous graph from 3rd July).

ddzialak commented 2 years ago

according to stats the CPU usage jump to 100% around 14:25 GMT, so next part of logs: az-2022-07-03T10-15.log.gz

edwin-huber commented 2 years ago

Thanks for the data, it will take a while to sift through...

ddzialak commented 2 years ago

It happened once more time, but this time I found the root cause (probably). Due to some network issues (what mainly caused some packages were dropped by system while reading from a socket) the node process started to keep more and more opened file descriptors and mentioned operations.length started to grow and wasn't decreasing even if network issue has been gone.

So looks like network issue may trigger "bad state", without any interesting error message in azurite logs, most probably because service doesn't know that client timed out and/or closed connection and service may/should also close the connection(?) My guts says, that setting some reasonable socket timeout on server side should fix that issue.

edwin-huber commented 2 years ago

Thanks for following up. This matches what I find for the first requests which are "getting stuck", where the last entries show:

2022-07-03T07:54:46.823Z 6809184a-5559-4682-abf1-f4f5be37ade4 debug: FSExtentStore:appendExtent() Start writing to extent ea46d88e-d61b-484f-a37c-8e7c20b6c517 
2022-07-03T07:54:46.823Z 6809184a-5559-4682-abf1-f4f5be37ade4 debug: FSExtentStore:streamPipe() Start piping data to write stream
edwin-huber commented 2 years ago

We see that the write operation is not completing for that operation.

ddzialak commented 2 years ago

and happened again... seems that azurite after upgrade is not ready for long running because something may cause that it does not release some network resources :-( image This time azurite was running without debug logs, so I have no debug logs from that time, after docker restart it is functional as usual.