Open Sti2nd opened 3 years ago
@Sti2nd
Thanks for reporting the issue! Would you please share what operation you have done to Azurite o repro this?
@blueww I just opened VS Code! I simply had Azure installed and opened VS Code; I didn't do anything with the extension at the time 😅
Hi, it looks to me like it was spending quite a bit of time with module loading. I have a couple of clarifying questions, if that is OK?
Were there any symptoms, i.e. was the system slow to respond while things loaded, or was this just reported by VSCode?
Were you doing other work, or was this in a virtualized machine with limited IO (network or disk)?
Hi, sorry for the delay, here are some answers:
Thanks for getting back to us.
I am fairly sure that we don't, but we need to validate that we are not loading an overly large number of modules.
As this is not a systematic issue, I assume that it is a symptom of a system under load from somewhere else, and the slow IO is why VS Code thinks that the module load is the problem.
We shall knock our heads together and see if there is a way to get more certainty on this.
I have observed similar issue with a docker image which we are using for testing. For daily usage its working perfectly, but after weekend (we run a lot of semi-long tests over the weekend) the docker gets into some troubles, requests are failing (usually due timeouts), one CPU works at 100%. Our tests do usually basic operations: create new bucket, create few files (few tests creates a bit more, one creates around 250 files and another 10k files), then read what was stored, sometime overwrite data, and after all delete bucket.
azurite version: 3.17.1
We configured azurite to work with a debug logs and found suspicious logs about "operations.length":
2022-07-03T07:54:46.754Z d4589bc5-9168-48ed-9466-0d54216433ad debug: OperationQueue:execute() Current runningConcurrency:49 maxConcurrency:50 operations.length:1
2022-07-03T07:54:46.754Z d4589bc5-9168-48ed-9466-0d54216433ad debug: OperationQueue:execute() Current runningConcurrency:0 maxConcurrency:100 operations.length:0
2022-07-03T07:54:46.786Z 3a5772db-1673-4d19-9428-8531765d84dc debug: OperationQueue:execute() Current runningConcurrency:0 maxConcurrency:100 operations.length:1
2022-07-03T07:54:46.786Z 3a5772db-1673-4d19-9428-8531765d84dc debug: OperationQueue:execute() Current runningConcurrency:49 maxConcurrency:50 operations.length:0
2022-07-03T07:54:46.786Z 3a5772db-1673-4d19-9428-8531765d84dc debug: OperationQueue:execute() Current runningConcurrency:0 maxConcurrency:100 operations.length:0
2022-07-03T07:54:46.822Z 6809184a-5559-4682-abf1-f4f5be37ade4 debug: OperationQueue:execute() Current runningConcurrency:49 maxConcurrency:50 operations.length:1
2022-07-03T07:54:46.825Z 70f19825-ceaf-488d-977f-2cf74ae0f6eb debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:1
2022-07-03T07:56:15.565Z 0a4c459c-166b-40e3-af12-d4792dcf4e87 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:2
2022-07-03T07:56:34.296Z 454908b5-2163-43a6-87e4-c5ca168c61c8 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:3
2022-07-03T07:56:45.214Z 502b6861-a016-4781-8436-0b961fe85e73 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:4
2022-07-03T07:56:58.062Z cf28e6b5-6eba-422c-a6c7-d1eb8f17ca19 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:5
2022-07-03T07:57:02.891Z 1ba4442e-81a5-4e07-8b3c-d143f0136cb4 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:6
2022-07-03T07:57:07.694Z 2f486ea5-6a16-4306-ae26-510cefcd6cb2 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:7
2022-07-03T07:57:13.067Z a1f967c5-69d5-48a1-9694-d90d62f36f86 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:8
2022-07-03T07:57:27.160Z 6f7e2d9d-b109-4349-8d86-af44d1cd6c60 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:9
2022-07-03T07:57:30.706Z e805954e-b0a6-4a2a-8b3b-d78c32273c82 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:10
2022-07-03T07:57:42.692Z 276de80b-5a3c-4f76-8354-bb65fbbc4f2a debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:11
2022-07-03T07:57:53.850Z 0517248f-8266-430c-a447-9d9d8c524b2b debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:12
2022-07-03T07:58:01.482Z 2d5d403b-5c41-4c11-a37e-cf1d90de1ebc debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:13
2022-07-04T08:11:16.420Z a64c0b52-1d92-4331-8d88-489edea6e3f0 debug: OperationQueue:execute() Current runningConcurrency:50 maxConcurrency:50 operations.length:14606
and time, when it started to increase corresponds to the time, when one CPU started to work at 100%:
docker stop azurite
and docker start azurite
cause that CPU usage get back to normal.... but next 2 weekends cause the same situation.
Thanks for the detailed information. The longer your tests run, the larger the database / files used for persistence, and the larger the debug log. A large amount of data in Azurite will mean more CPU load and memory are required to serve the database and indexes used to persist data for the API emulation. This can lead to contention for resources and lead to symptoms like 100% on some systems. Recommendations:
To precise some previous information:
operations.length
never decrease.Thanks for the clarifications, recommendations stand for all readers at the time of writing, even if not all apply to you 😺
Could you explain "more hosts execute them"? Does this mean that you have a single host running Azurite, to which multiple hosts / client processes are connecting?
I can't really see how you are processing these items. If the filesystem cannot keep up, this could be blocking IO and causing the issue.
To properly help determine the root cause, we would need a more complete debug log of the time when requests are building up, as well as more information about the system used to execute / host the Azurite container.
If it is an issue with IO, we could potentially disable persistent storage, and thereby avoid using the file system to maximize throughput.
"more hosts execute them"
We have a bunch of hosts for our tests, and from Monday to Friday these hosts execute tests for pull requests (around 25 tests are using azurite) and it's working. At the weekend all hosts executes all tests in a cycle so number of executions increase significantly.
Does this mean that you have a single host running Azurite, to which multiple hosts / client processes are connecting?
yes, one host has a docker with azurite, each testing host is using the same endpoint. It was working quite a long time but after upgrade (I'm not sure what was a previous version of azurite) it started to get into "bad state".
If the filesystem cannot keep up
no, persistent storage is just a docker bind volume to host (standard ext4
partition, it was like that also before azurite upgrade).
If it is an issue with IO, we could potentially disable persistent storage, and thereby avoid using the file system to maximize throughput.
It's rather not related to IO performance:
Part of logs from azurite when the problem happened first time:
az.log.gz
The operations.length
started to grow around 2022-07-03T07:56:34
and few hours(?) later CPU usage jump to 100%, (see previous graph from 3rd July).
according to stats the CPU usage jump to 100% around 14:25 GMT, so next part of logs: az-2022-07-03T10-15.log.gz
Thanks for the data, it will take a while to sift through...
It happened once more time, but this time I found the root cause (probably).
Due to some network issues (what mainly caused some packages
were dropped by system while reading from a socket) the node
process started to keep more and more opened file descriptors and
mentioned operations.length
started to grow and wasn't decreasing
even if network issue has been gone.
So looks like network issue may trigger "bad state", without any interesting error message in azurite logs, most probably because service doesn't know that client timed out and/or closed connection and service may/should also close the connection(?) My guts says, that setting some reasonable socket timeout on server side should fix that issue.
Thanks for following up. This matches what I find for the first requests which are "getting stuck", where the last entries show:
2022-07-03T07:54:46.823Z 6809184a-5559-4682-abf1-f4f5be37ade4 debug: FSExtentStore:appendExtent() Start writing to extent ea46d88e-d61b-484f-a37c-8e7c20b6c517
2022-07-03T07:54:46.823Z 6809184a-5559-4682-abf1-f4f5be37ade4 debug: FSExtentStore:streamPipe() Start piping data to write stream
We see that the write operation is not completing for that operation.
and happened again... seems that azurite after upgrade is not ready for long running because something may cause that it does not release some network resources :-( This time azurite was running without debug logs, so I have no debug logs from that time, after docker restart it is functional as usual.
Azurite.azurite-unresponsive.cpuprofile.txt
Performance
azurite
3.13.1
Windows_NT x64 10.0.19042
1.58.2
Find more details here: https://github.com/microsoft/vscode/wiki/Explain-extension-causes-high-cpu-load