Azure / Azurite

A lightweight server clone of Azure Storage that simulates most of the commands supported by it with minimal dependencies
MIT License
1.8k stars 320 forks source link

500 Errors - Error: EMFILE: too many open files #1967

Open blperf opened 1 year ago

blperf commented 1 year ago

Which service(blob, file, queue, table) does this issue concern?

Blob

Which version of the Azurite was used?

3.20.1

Where do you get Azurite? (npm, DockerHub, NuGet, Visual Studio Code Extension)

Visual Studio

What's the Node.js version?

n/a

What problem was encountered?

error: ErrorMiddleware: ErrorName=Error ErrorMessage=EMFILE: too many open files, open 'c:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\Extensions\Microsoft\Azure Storage Emulator__blobstorage\74924663-2a08-4213-8c52-74637ab9dd77' ErrorStack="Error: EMFILE: too many open files, open 'c:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\Extensions\Microsoft\Azure Storage Emulator\blobstorage__\74924663-2a08-4213-8c52-74637ab9dd77'"

Steps to reproduce the issue?

From debug log: 2023-04-19T00:09:59.105Z 831912d4-1317-445d-b45d-e9f6e62e8566 error: ErrorMiddleware: Received an error, fill error information to HTTP response 2023-04-19T00:09:59.105Z 831912d4-1317-445d-b45d-e9f6e62e8566 error: ErrorMiddleware: ErrorName=Error ErrorMessage=EMFILE: too many open files, open 'c:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\Extensions\Microsoft\Azure Storage Emulator__blobstorage\74924663-2a08-4213-8c52-74637ab9dd77' ErrorStack="Error: EMFILE: too many open files, open 'c:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\Extensions\Microsoft\Azure Storage Emulator\blobstorage__\74924663-2a08-4213-8c52-74637ab9dd77'" 2023-04-19T00:09:59.105Z 831912d4-1317-445d-b45d-e9f6e62e8566 error: ErrorMiddleware: Set HTTP code: 500

Have you found a mitigation/solution?

Killing and restarting Azurite resolves the issue immediately, but I'm looking for a better solution to avoid this in the first place. Is there anything that can be done to prevent this? SAS durations etc? Any way to monitor how many files it has open to see when the limit is being approached?

Still don't fully understand the issue and didn't have much look finding any existing documentation on it, any insight as to best practices and how to avoid this would be much appreciated!

blueww commented 1 year ago

@blperf

Azurite maintains cache for opening file handles, it seems the cached item count exceeds the OS limitation. To continue the investigation, would you please share the whole Azurite debug log from Azurite start to repro this issue? (hide credential)

blperf commented 1 year ago

@blueww

Thank you for looking into this, I appreciate the context on the cache. I've just emailed a copy of the debug log to you directly with the PII manually scrubbed. I wanted to email it as opposed to posting it publicly out of an abundance of caution just in case there was any PII I missed. If there is a better way to get the file to you directly, please let me know and I would be happy to do so.

blueww commented 1 year ago

@blperf

Thanks for the update! We have received the debug log, and will need time to look into it.

As we have never see this issue from other customer's feedback, could you share more information for the investigation:

  1. The repro machine: OS, CPU, memory, disk
  2. the steps to repro the issue. (As I see from debug log, it looks you are repeatedly download a big block blob. is it your repro scenario?)

Besides that, I have tried to download a block blob with ~1GB size repeated with same Azurite.exe version, but can't repro this issue.

blperf commented 1 year ago

Hi @blueww

Thank you very much for taking the time to look into this and for providing such a prompt update, I greatly appreciate it.

Yes that is correct. I've noticed with ~1GB file size it takes many more attempts to cause the 500 error than with a >2GB file. It seems like the larger the file, the less downloads it takes before the 500 occurs.

I have been able to reproduce this issue consistently on two different physical machines now. Here is the information for both machines.

Machine 1 OS: Windows 10 CPU: AMD Ryzen 5 3600 (x64) Memory: 16GB DDR4 Disk: 1TB SSD (Silicon Power SATA III)

Machine 2 OS: Windows 11 CPU: AMD Ryzen 5 5600G (x64) Memory: 16GB DDR4 Disk: 1TB SSD (Samsung 970 EVO Plus 1TB NVMe M.2)

It's worth noting I'm also using metadata properties on the blob, not sure if this would have any impact.

On both machines, I am using the same version of Azurite installed directly from the Visual Studio 2022 service dependency. Can I ask which installation method of Azurite you are using, or if you're building directly from the source? If there is any workaround I can try in the meantime, I'm happy to try different Azurite installation methods (e.g. installing via npm, running from a docker container etc) to see if there is a short term workaround for this issue I can use while further investigation is ongoing.

If you have any recommendations for what possible methods might be best to start with testing to avoid the OS limitation, please let me know and I can test them out and report back.

blueww commented 1 year ago

@blperf

I can repro the issue now. It looks caused by you are download a big blob (I repro with a 3 GB blob) with single request. We will continue look into this issue, but might will take time to find the root cause and fix.

Before the issue is fixed, would you like to download the blob range by range (like each request only download 4MB), instead download very long content with 1 request. (many Azure storage SDK/tools can download blob range by range) It's also not recommended to download a big blob with only 1 request, it's easy to fail, and the resume cost is also high.

blperf commented 1 year ago

@blueww

Happy to hear you were able to reproduce the issue, oftentimes that can be the most difficult part, thank you for continuing to look into this.

I will attempt to test some range options in terms of how I am pulling down the file. Currently, I am using this BlobClient.GenerateSasUri method from the C# SDK with the expiresOn set to DateTimeOffset.UtcNow.AddMinutes(30). I then pass this generated URI as the source of an HTML5 video tag. I'm not sure if there is any way to control the range headers sent by the browser to download the blob (at least that I have found so far).

Do you have any other suggestions for how customers typically stream large .mp4 files? I'm certainly willing to try this another way if there is a better method for doing this. However, I don't seem to have this issue when doing things the same way on a cloud Azure Storage Account, only when using the Azurite emulator which currently makes local development difficult.

blueww commented 1 year ago

@blperf

To download a blob rage, you can add a header "x-ms-range", like : "x-ms-range":"bytes=12582912-16777215". see more details : https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob?tabs=azure-ad#request-headers https://learn.microsoft.com/en-us/rest/api/storageservices/specifying-the-range-header-for-blob-service-operations

However, I am not so familiar with send request with browser and stream large .mp4 files, so still not find a way to set the header in send request with browser.

We might need more time for investigation.

blperf commented 10 months ago

Hi @blueww, have you made any further progress on this issue? I was able to reduce how often this happens by changing the encoding of the mp4 file, but eventually this issue still arises no matter what. Even with smaller blobs, after the blob is opened enough times, this error eventually arises and the only way to fix it is to restart azurite. It seems like some kind of fundamental issue with the way it interacts with the filesystem. Is there any other workaround for this? Would running on linux prevent this from happening? How come this issue doesn't present itself with regular Azure storage?

Thanks again for your help, it's greatly appreciated!

blueww commented 10 months ago

@blperf

I can't repro this issue with latest Azurite now. I can release is before by download a 3GB blob with single request for 1-2 times. But currently with Azurite 3.26.0 (run in VS code extension), I download 3GB blob with single request for 10+ times, but still can't repro it. I use node v18.16.0, and OS is Windows 10 Enterprise.

Not sure if you still can repro it with latest Azurite?

blperf commented 10 months ago

Hi @blueww

I have confirmed that the issue is still easily reproducible in the latest version (3.26.0). I have done a fresh install of Azurite 3.26.0 with: npm install -g azurite. Then I upload two large files ~2GB. After reading each of the files a couple of times, I get the same 500 error. I am using latest node LTS as of the time of writing (v18.18.2) also in Windows 10, so very similar to your environment. I have emailed a complete debug.log from first time startup showing a complete repro of this issue in version 3.26.0. This log is from the first time startup after install, creating a single container, uploading two large files (with some metadata), then reading each a couple times until the error occurs. I started azurite for this log with: azurite --skipApiVersionCheck -d <PathToDebugLog>

This thread seems to be very similar to the issue occuring in azurite, seems like the createReadStream function in FSExtentStore.ts is not cleaning up after itself properly? https://github.com/nodejs/node-v0.x-archive/issues/6041

I suspect based off the logs maybe this could be where in the azurite code the issue is coming from? https://github.com/Azure/Azurite/blob/62ba685b54ff68aff147597a45e6e947267de097/src/common/persistence/FSExtentStore.ts#L343C1-L355C12

blueww commented 10 months ago

@EmmaZhu
Would you please help to look into this issue and see if you can repro it?

blperf commented 10 months ago

Hi @EmmaZhu

Please let me know if there are any additional details I can provide to help or if you need me to forward you the log I emailed to @blueww. For reference, I am using the following C# API calls. Accessing this file several times on this version installed with npm on a >2gb .mp4 file should reproduce the issue. If it makes a difference, I am accessing the file with an html5 video element with the videoUri set to the element's src, but I believe you should be able to repro by simply accessing the file.

BlobContainerClient containerClient = blobServiceClient.GetBlobContainerClient(yourContainerName); BlobClient blobClient = containerClient.GetBlobClient(blobs.FirstOrDefault().Name); Uri videoUri = blobClient.GenerateSasUri(BlobSasPermissions.Read, DateTimeOffset.UtcNow.AddMinutes(30));

Please let me know if you still can't repro after trying these steps and I will see what I can do to help repro on your end.

blperf commented 10 months ago

Hi @EmmaZhu and @blueww,

I was able to fix this issue, and have created a PR with the fix here: https://github.com/Azure/Azurite/pull/2267. Please feel free to modify this fix as appropriate for the coding standards/best practices of this repo. It would be really great to get this fix or some version of it merged so that others may benefit.

To summarize the issue, what was happening is that the file descriptors were not getting closed after createReadStream was called. This occurs whenever a connection is terminated by the client, and therefore the server's response object is closed before the file descriptor is cleaned up, resulting in a leak that eventually hits the OS limit and throws the EMFILE error.

The fix is simply to catch the response object being closed in the middleware. This is a known issue referenced here: https://github.com/nodejs/node-v0.x-archive/issues/6041#issuecomment-22644347. The fix in this PR was essentially just the known fix linked at the code snippet above but adapted to function with the middleware used in the Azurite codebase.

The res.on("close") is caught in ExpressMiddlewareFactory.ts. This fires every time a client terminates a connection (e.g. reloading a webpage, closing a tab etc with the link set to an Azurite SAS uri). From there, a cleanUpBlob function is introduced in BlobHandler.ts, which passes the request on to FSExtentStore in the cleanStreams function. FSExtentStore maintains a simple dictionary of existing streams according to their contextId. When a connection is terminated, the streams are then cleaned up by calling destroy, preventing the leak.

Perhaps there is an alternative to the dictionary approach, but any true fix for this issue will need to somehow clean up the streams created in FSExtentStore whenever res.on("close") fires. The dictionary per contextId seemed like the most appropriate option for this, but other approaches are welcome.

blperf commented 6 months ago

Hi @blueww,

Wanted to circle back on this, I haven't heard back from @EmmaZhu in awhile on the comments in the PR. It seems the original fix in this PR made the situation much better but is not quite a perfect fix. Is @EmmaZhu still the right person to work with on this, or is there anyone else who may be available to help get this resolved? Doing my best to help contribute where I can, but I'm not too familiar with this codebase so trying my best to learn as I go here. This should hopefully be a quick fix for someone who already has the right experience and context now that we've isolated the issue, hopefully someone on the team can help to quickly squash this bug :blush:

Thank you for your help so far, I greatly appreciate it!

blueww commented 6 months ago

@blperf

Thanks for your contribution! @EmmaZhu is still work on this fix PR https://github.com/Azure/Azurite/pull/2331. However the Unit test failed with current changes and still can't find the root cause and how to fix it. Still in investigation.

blperf commented 6 months ago

@blperf

Thanks for your contribution! @EmmaZhu is still work on this fix PR #2331. However the Unit test failed with current changes and still can't find the root cause and how to fix it. Still in investigation.

@blueww Thank you for the update. I like the approach of the attempted fix in @EmmaZhu's PR with the LazyReadStream, I think something like this could work. Unfortunately I pulled these changes down and was still able to reproduce the same "EMFILE: too many open files" error. It seems like even with these changes, there is still a leak somewhere. So aside from the unit test issues blocking this getting merged, unfortunately I don't think this PR fully resolves the bug either 😢

EmmaZhu commented 5 months ago

Hi @blperf ,

Could you share the test case you used? I'll try to reproduce it in my environment and try to find where it leaks...

Thanks a lot for any helping.

blperf commented 5 months ago

Hi @EmmaZhu,

Thanks for continuing to get to the bottom of this issue.

The repro I have for this is simple, I upload a large .mp4 file as a blob, and I set the src of an html video element to the URL of the blob file. Simply refreshing a basic webpage with this video element enough times will cause the leak in azurite. Eventually, it runs out of memory and azurite crashes.

To automate this, I installed a simple web extension to refresh the page automatically. After about fifteen minutes or so (of refreshing the page every 10 seconds), it consistently causes azurite to crash.

This should hopefully be simple to reconfigure in your environment. I hope this helps!

blperf commented 3 months ago

Hi @EmmaZhu and @blueww,

I was finally able to get a working complete fix together for this issue. This solution was built on top of @EmmaZhu's original PR. I added a circular buffer to clean up the streams. I tried several other implementations, but with what I could accomplish within node/the existing codebase, this ended up being the cleanest. I've tested this in my repro environment at this point and it seems to have alleviated the issue.

The latest PR for this can be found here: https://github.com/Azure/Azurite/pull/2404

This PR is taken off of the latest from main. Hope we can get this merged at some point so others may benefit as well. Feel free to make any other changes required on top of this PR in order to get it merged.