MemoryErrors Began Recently, Breaking our Python Function App

ericdrobinson commented 5 years ago

Investigative information

Timestamp: 2019/04/26 @ 3pm
Function App name: [PRIVATE] Contact if needed.
Function name(s) (as appropriate): [PRIVATE]
Core Tools version: 2.6.1048

Function App Overview

We built a system that processes audio files. Each input file is a compressed format (m4a and mp3 are the most common). The main processing function uses ffmpeg, which is included as a static binary, to decompress the audio file into raw samples for processing. While the system was able to handle unpacking 90mins of audio, we found that the processing time involved for complete unpacking was unreasonable, causing the entire function to take ~4mins to process. Running this process locally showed that the Python process was responsible for allocating a maximum of ~1.5-2GB of memory.

We were able to reduce this dramatically by streaming the decompression process. However, the streaming had some processing overhead such that "shorter" audio files actually took longer to process overall with the streaming approach. We determined that ~10mins in length would be a fine length to switch modes.

To make the most of each Function App instance, we make use of the Python multiprocessing module (via ProcessPoolExecutor). We use a single Process for processing audio data with ffmpeg (which itself is multi-threaded). We then use numCPUs - 1 Processes to handle downloading audio files to prepare for the processing phase. In this way, we can download multiple audio files to the function at a time while one audio file is being processed. Attempting to process multiple files simultaneously results in worse performance overall (we suspect that the multiple instances of ffmpeg have threads that fight each other for time on the CPU cores).

Things Broke On Us

This system worked with no issues as of March 24th, 2019: we could queue up as many audio files of whatever size we liked and the system would churn through them. Happy with the system, we moved on to another task. As our Function App is on a Linux Consumption Plan, we left everything "as-is", expecting to return to it at a later date for integration into a larger system.

That later date was yesterday and we were surprised to find that something changed in the intervening month and broke our Function App. All of a sudden our Function App would stop responding to requests. Looking into the Monitor and Live Metrics Stream views, we noticed several things:

Monitor:
1. Doesn't show all logs anymore.
Live Metrics Stream:
1. Only shows "1 Server Online".
2. Server ID in the Sample Telemetry stream is "@(none)".
3. The "Servers" section at the bottom shows no useful information (and only ever one server instance).

Previously we could rely on the Monitor section to show us logs of both successful and unsuccessful runs of the functions in our Function App. This is no longer the case.

Previously we could use the Live Metrics Stream to show when more than one server instance had spawned to handle the request load. Unless that capability is currently gone, this feature is broken.

What We've Discovered

Luckily for us, the Live Metrics Stream appears to work and provided us with a sampling of log messages that we could use to attempt to identify the source of the issue. By watching this stream and inspecting the Errors, we found the following:

The first error to occur is always caused by a MemoryError Exception in Python. This comes from a call to np.ascontiguousarray which is used to convert the audio samples from 16bit integers to 32bit float values. For large audio files, the array size here can be hundreds of MiBs in size (if not over 1GiB).
Every subsequent run of the function fails to start the ffmpeg process. It looks like the system is getting an OSError, indicating that it was "trying to execute a non-existent file."
Once the Function App was shut down (from a timeout) and restarted by adding a request, the instances would again be able to access ffmpeg and process happily (when provided audio files at "the right size").
If lots of audio file processing requests hit the Function App at once and the system breaks due to the MemoryError, then ~10mins from the break the timeout will occur and messages may begin to process again (usually succeeding without issue).

Certain audio file sizes can trigger the MemoryError to occur 100% of the time. In some circumstances, it seems that smaller audio files can trigger the MemoryError, though it is likely related to the memory state caused by simultaneously downloading files.

Questions

This leads us to the following questions:

What changed with memory handling in Function Apps [possibly Python-specific] to cause this issue?
How do we fix/work around this issue?

Repro steps

Reproduction is proprietary, but we have two ways to do it:

Unpack a 10min long audio file into a function with 32bit samples (stereo). (Specifically: with ffmpeg via the audioread module.)
Use our Function App to process a queue of multiple (~50) smaller (~3:30) audio files.

Expected behavior

The Function App continues to process audio files without interruption.

Actual behavior

The Function App fails due to a MemoryError and suddenly goes into a "Broken" state where the included ffmpeg static binary "goes missing".

Known workarounds

No known workarounds at this time.

Related information

Triggers/Bindings Used:

HTTP - Used to add requests to the processing Queue.
Queue Storage - Used to trigger audio file processing.

maiqbal11 commented 5 years ago

There's nothing that's change on the worker side with respect to memory handling. @fabiocav, I believe this shouldn't have anything to do with functions runtime level changes either, though I do know that there are some perf improvements that might already have landed. @ahmelsayed, @balag0, @divyagandhii, the largest change that's landed during this timeframe is LCOW. Is it likely that shared memory across different containers might be causing this? If so, how can we go about investigating?

ericdrobinson commented 5 years ago

I've managed to narrow down the memory error to a specific amount of memory that causes the issue. In a freshly booted Function App instance, if I add a message to the Queue that contains a request to process an audio file who's unpacked size is ~100MiB, then the Memory Error will trigger when the underlying Python code attempts to allocate a new ndarray of equivalent size (it is copying an "accumulator" array using numpy.ascontinuousarray).

ericdrobinson commented 5 years ago

Another update with more information. Back in March our tests showed that our Function App was running on a "system" with 4 [logical] CPUs (via os.cpu_count()). When running today, we have access to only 2 CPUs.

ericdrobinson commented 5 years ago

Another update. We emulated what audioread does to run ffmpeg (subprocess.Popen) to determine precisely what error was being triggered to cause the ffmpeg run to fail.

We were able to verify that the Popen never completes. The Traceback points to this line of code and reports the following error:

OSError: [Errno 12] Cannot allocate memory

So. We definitely have a memory handling issue here. We'll do some more tests with the psutil library to see what the memory status of the system is, but the fact that everything breaks after allocating ~100MiB of data is a really bad sign.

balag0 commented 5 years ago

@eric Thanks for all the information. While this may or may not be applicable for this app, for consumption sku a limit of 1.5 GB memory will apply. Some more information here https://github.com/Azure/azure-functions-host/issues/2501 We will investigate and provide an update/mitigation soon.

ericdrobinson commented 5 years ago

@balag0 I'm aware of the 1.5GB memory limitation, yes. However, isn't there something like 10GB in storage space? Wouldn't large memory allocations simply bump some memory out into swap?

This is separate from the 100MiB issue I've outlined above, of course: 100MiB doesn't even come close to the 1.5GB limit...

[Also, I'm @ericdrobinson... Apologies to the lucky Eric who snagged that username - probably gets a lot of this type of misattribution 😉]

ericdrobinson commented 5 years ago

I'm back with more information.

I decided to take a look at what's going on with the memory in the function environment. I added the psutil module and wrote a simple informational function:

def log_context_stats():
    """
    Retrieves and logs certain stats about the operating context for
    this process.
    """

    proc = psutil.Process(os.getpid())
    mem_info = proc.memory_info()

    stats = f"""Function Invocation Stats:
            Num CPUs: {os.cpu_count()}
            Virt Mem: {psutil.virtual_memory()}
            Swap Mem: {psutil.swap_memory()}
            Proc Mem: rss={mem_info.rss}, vms={mem_info.vms}
            """

    logging.info(stats)

I then called this function in two locations:

Immediately upon function invocation.
Immediately after catching the initial MemoryError.

Here is the output* from location 1:

Function Invocation Stats:
Num CPUs: 2
Virt Mem: svmem(
    total=1,039,429,632,
    available=240,713,728,
    percent=76.8,
    used=641,794,048,
    free=75,522,048,
    active=683,995,136,
    inactive=177,491,968,
    buffers=15,876,096,
    cached=306,237,440,
    shared=253,952,
    slab=67,407,872)
Swap Mem: sswap(total=0, used=0, free=0, percent=0, sin=0, sout=0)
Proc Mem:
    rss=144,080,896,
    vms=1,122,570,240

And here is the output* from location 2:

Function Invocation Stats:
Num CPUs: 2
Virt Mem: svmem(
    total=1,039,429,632, 
    available=39,194,624, 
    percent=96.2, 
    used=893,681,664, 
    free=79,187,968, 
    active=860,884,992, 
    inactive=22,732,800, 
    buffers=413,696, 
    cached=66,146,304, 
    shared=253,952, 
    slab=40,701,952) 
Swap Mem: sswap(total=0, used=0, free=0, percent=0, sin=0, sout=0) 
Proc Mem: 
    rss=304,037,888, 
    vms=1,570,119,680

(* Newlines and thousands separators added to numbers for readability.)

First, I should note that the swap_memory is showing 0 for everything. This might be due to the fact that the function is run in a Docker container and special settings are needed to access that information in these environments. Alternatively, it could be that someone turned off swap.

Second, what I found extremely startling is that a fresh run of the function showed only ~240MB of virtual_memory available. When the MemoryError was hit, the available memory had dropped by almost exactly 200MB (to ~39MB). Similarly, the process' virtual memory starts at ~1.1GB and is already at ~1.57GB by the time the app runs out of memory.

It is perhaps valuable to recall at this point that the audio file being processed is just over the 100MiB line that causes the memory error to trigger when fully decompressed. I have no idea if the audio file is just large enough to run us out of available memory with array allocations/reallocations or if there is something otherwise special about the ~100MiB number.

Regardless, there is much about this state of affairs that is disconcerting.

Why is there only ~240MB available when the function starts up fresh?
Perhaps more telling/pertinent, is the process' virtual memory. Why does it start at over 1GB?
Is swap still enabled? Even if we ran out of RAM, I would expect the process to slow down as it shuffled memory over to disk (of which there should be plenty of space). I had previously assumed that it was swap memory that saved us with our massive files back in March.

While having no swap memory might help explain the situation we're experiencing, it is pretty shocking to see such high memory usage numbers on a clean start of a relatively simple function app instance.

I hope this information is helpful in tracking down the source of the issue.

balag0 commented 5 years ago

Thanks for the detailed information @ericdrobinson (got the name right this time :) ). This was very helpful.

Turns out this was due to a bad configuration setting on our side which limited the amount of memory allocated to the container. Unfortunately it would be sometime next time before we can fix this.

Once the configuration is fixed, you should start seeing more available memory on function startup.

As for the memory usage being so high at startup, I see the following values with a different (may be simpler) function app Proc Mem: rss=33562624, vms=664834048 So it is possible additional memory usage you see is due to the nature of the function app itself.

Even with paged memory, the app still cannot exceed the 1.5 GB limit. So swap memory being 0 shouldn't matter as long as you get 1.5 GB (less memory taken by Functions runtime process).

ericdrobinson commented 5 years ago

@balag0 I'm glad to hear that a culprit was found!

On Misconfiguration

Turns out this was due to a bad configuration setting on our side which limited the amount of memory allocated to the container. Unfortunately it would be sometime next time before we can fix this.

Once the configuration is fixed, you should start seeing more available memory on function startup.

@balag0 Any indication as to how much memory was being withheld by this misconfiguration? This information will help set expectations so we can best plan our mitigation strategy.

On Base Memory Usage

As for the memory usage being so high at startup, I see the following values with a different (may be simpler) function app Proc Mem: rss=33562624, vms=664834048 So it is possible additional memory usage you see is due to the nature of the function app itself.

I would not be surprised if yours was simpler. Our function application has two functions. Each of those uses the azure.storage.blob package. The only other packages we specifically install are librosa and audioread. Librosa brings numpy (and scipy) along with it and we make direct use of numpy in one of our functions.

In theory, none of these should account for the additional 400MB of vms that we see over your example.

We are using the ProcessPoolExecutor which will have an impact on memory as it forks the Python process when one of our function instances begins so that it can work without gumming up the works, but the number of processes we spawn is capped by the number of CPUs we have access to (2). What's more, running the function [synchronously] on a local macOS machine shows that the Python process eats all of 80MB of memory.

@balag0 During your investigation, were you able to profile a basic Python Function App to determine how much of the 1.5GB of memory is chewed up by the base Function App software stack itself? If the foundation takes up 650MB of memory then it might be a wee bit disingenuous to say that you can expect access to 1.5GB of memory when using Azure Functions with [language].

On Swap Memory / App-Local Storage

Even with paged memory, the app still cannot exceed the 1.5 GB limit. So swap memory being 0 shouldn't matter as long as you get 1.5 GB (less memory taken by Functions runtime process).

@balag0 Wait, what? That's a pretty sizable limitation there. How much app-local storage space do we have access to with a function app?

balag0 commented 5 years ago

We expect to have an additional ~600MB available to the container.

I will try to add a wiki page to the host repo soon once I have more information. The limits will be configured such that the function app will be able to use all of 1.5 GB (minus whatever the functions runtime takes). We will account for the 400MB and adjust the overall limit accordingly if required.

The limit of 1.5 GB is just for the commit size which is the same limit as what we already have on Windows consumption sku. Is your concern that there is 0 swap space available? Even if we enabled swap space the overall limit of 1.5 GB will apply for that also.

On windows skus apps should have atleast 500MB of temp storage available to them as documented at https://github.com/projectkudu/kudu/wiki/Understanding-the-Azure-App-Service-file-system You can expect to see the same limits for Linux also.

ericdrobinson commented 5 years ago

On Additional Memory

We expect to have an additional ~600MB available to the container.

Excellent. Thanks very much!

@balag0 Was the change that caused this issue in response to something? Was it the first time a proper limitation was actually set on Azure Functions for Linux (or in general)? I'm really curious to understand how we were able to decompress a ~2GB file without hitting memory errors back in March.

On Documenting and Handling Memory Limits

I will try to add a wiki page to the host repo soon once I have more information. The limits will be configured such that the function app will be able to use all of 1.5 GB (minus whatever the functions runtime takes). We will account for the 400MB and adjust the overall limit accordingly if required.

I would love to know as soon as such a wiki page exists. Information on memory (RAM/swap) is exceedingly sparse in the current documentation. This will be extremely helpful!

On the 1.5GB Commit Size Limitation

The limit of 1.5 GB is just for the commit size which is the same limit as what we already have on Windows consumption sku. Is your concern that there is 0 swap space available? Even if we enabled swap space the overall limit of 1.5 GB will apply for that also.

My concern isn't so much that there isn't swap space available, but that my default expectation was that there would be swap space available. We're an apps/software team and are used to working in environments where swap is a given. That we didn't previously run into memory issues somewhat reinforced this: "Well, I guess the memory is being swapped to disk as necessary."

It may be worthwhile mentioning the lack of swap in the documentation.

By the way, being able to watch the Memory/CPU usage in the various metrics frontends would be extremely useful. Spitting out memory usage debug messages every once in a while is a pain, especially when they show up so infrequently in the Sample Telemetry output (and frequently never make it to the functions' log Monitor...

j08lue commented 5 years ago

We expect to have an additional ~600MB available to the container.

Please notify us, @balag0, when the memory allowance has been increased to 1.5 GB and Azure Functions becomes usable for us again. Or why not make it 3 GB, like λ has? ;-)

balag0 commented 5 years ago

Update - The updated memory config is available in East Asia now. It will be deployed to rest of the regions tomorrow.

Currently consumption sku doesn't support > 1.5 GB workloads. App service plan function apps should though..

rquintino commented 5 years ago

Hi @balag0 , we're experiencing a similar issue, we think may be related to maxing out the memory also (tensorflow, so very probably). We're using consumption plan.

But main question is what the expected behavior when this happens? What we are seeing is, like described above, app stops responding for very long periods of time, we don't see any error, nothing, these requests are not even counted, and never return so client app get blocked forever.

thoughts? thanks!

balag0 commented 5 years ago

@rquintino The expected behaviour is Out of memory Exception if the app runs out of memory. Please open a new issue and provide details of sitename, approx timestamp. You can also follow https://github.com/Azure/azure-functions-host/wiki/Sharing-Your-Function-App-name-privately if required. We can investigate further. I suspect the problem lies elsewhere.

Azure / azure-functions-python-worker