dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.12k stars 4.7k forks source link

Rfc2898DeriveBytes severe memory spike on Android #102406

Closed borrrden closed 4 months ago

borrrden commented 5 months ago

Description

Using any form of Rfc2898DeriveBytes on .NET Android results in a severe memory spike that gets worse with each call. Here is the obnoxious part: It only happens with release builds (AOT enabled) on an aarch64 emulator (M2 mac). I don't have an aarch64 device to try to see how it compares. Debug builds don't show this behavior, and neither debug nor release appears to show this behavior on x86_64 emulators.

I see an almost instant jump of nearly 300 MiB of RAM. I have some tests that repeatedly call Rfc2898DeriveBytes to derive a password using PBKDF2, and the tests often die because the OS kills it due to memory pressure. I do not see this spike when running the same tests on Windows.

The odd part is that after the tests finish running the memory instantly goes back down to normal. It is reported by dumpsys meminfo as "Lost RAM" while the tests are running.

Steps

I will work on getting a project, but all I need to do to trigger the problem is call Rfc2898DeriveBytes.Pbkdf2("foo", Enumerable.Repeat((byte)1, 16).ToArray(), 64000, HashAlgorithmName.SHA, 32); in fairly rapid succession.

Configuration

I tried both 8.0.105 and 8.0.300 .NET. Built on macOS 14 M2 mac (arm64) and executed on a vanilla emulator with 1GiB of memory and 1 GiB SD card.

Regression?

I am not sure about this since I only recently started running on M2. x86_64 never did, and still does not, show the issue.

Data

I'm happy to gather any data that I can but I'm lost as to how to do it on Android.

Analysis

Without data I am unable to reason about any of this. All I can see is what I get out of dumpsys meminfo and the fact that my tests crash in xharness.

dotnet-policy-service[bot] commented 5 months ago

Tagging subscribers to this area: @dotnet/area-system-security, @bartonjs, @vcsjones See info in area-owners.md if you want to be subscribed.

borrrden commented 4 months ago

Actually I got this to happen on x86_64 emulator as well using the attached project (compiled in release mode with AOT):

DeriveBytes.zip

Just spam click "Click Me" 3 or 4 times, then check the memory usage. On x86_64 it is reported as Used RAM instead of Lost RAM but the symptoms are the same. The more times you press the button in rapid succession, the more the memory usage goes up and stays pegged:

Example flow:

# After Launch
Free RAM:   881,651K (  161,939K cached pss +    66,496K cached kernel +   653,216K free)
 Used RAM:   580,518K (  484,378K used pss +    96,140K kernel)
 Lost RAM:    68,487K

# Press button once
 Free RAM:   819,647K (  159,275K cached pss +    66,396K cached kernel +   593,976K free)
 Used RAM:   643,413K (  547,177K used pss +    96,236K kernel)
 Lost RAM:    67,596K

# Press button twice quickly
 Free RAM:   783,257K (  159,293K cached pss +    66,444K cached kernel +   557,520K free)
 Used RAM:   678,057K (  581,725K used pss +    96,332K kernel)
 Lost RAM:    69,342K

# Press button three times quickly
 Free RAM:   600,852K (  159,228K cached pss +    66,476K cached kernel +   375,148K free)
 Used RAM:   862,387K (  765,715K used pss +    96,672K kernel)
 Lost RAM:    67,417K

# Press button four times quickly
Free RAM:   579,750K (  141,446K cached pss +    48,756K cached kernel +   389,548K free)
 Used RAM:   882,731K (  785,875K used pss +    96,856K kernel)
 Lost RAM:    68,175K

# Press button five times quickly
 Free RAM:   541,689K (    7,129K cached pss +    48,764K cached kernel +   485,796K free)
 Used RAM:   922,509K (  832,717K used pss +    89,792K kernel)
 Lost RAM:    66,458K

# After kill app
Free RAM:   799,174K (    7,118K cached pss +    63,296K cached kernel +   728,760K free)
 Used RAM:   653,616K (  563,908K used pss +    89,708K kernel)
 Lost RAM:    77,866K

This RAM usage never goes back down, even after waiting several minutes with no button presses.

borrrden commented 4 months ago

Maybe I'm wrong about all this and it's just not garbage collecting in the app, but press the button enough times and the app crashes so that's not ideal.

vitek-karas commented 4 months ago

@jkurdek @simonrozsival could one of you please take a look if we can repro it?

simonrozsival commented 4 months ago

I am able to replicate the issue, although it takes around 30 "concurrent" calls to the method to crash the app for me (running Android API 33 emulator on Mac M1). The adb logcat output contains this message so clearly the problem is high memory usage:

scudo   : Scudo OOM: The process has exhausted 256M for size class 192.

When I run the methods one by one for a few minutes, I don't observe any crash and so it doesn't seem there's a memory leak. It appears that Rfc2898DeriveBytes.Pbkdf2(..., iterations: 64_000, ...) takes a lot of memory on its own and running multiple of them might exceed the OS memory limit for the process.

We have a specific managed implementation of this algorithm that we use only on Android and Browser. On the other platforms we use custom platform-specific code (Windows, Mac, OpenSSL). It is possible that the managed implementation is more memory hungry than the native ones and it is easier to hit the memory allocator limit?

vcsjones commented 4 months ago

It is possible that the managed implementation is more memory hungry than the native ones and it is easier to hit the memory allocator limit?

It does allocate however I would not expect this to put an intense amount of memory pressure on an Android application. Most of the time it is writing to stack buffers, and the requested key in the example is only 32 bytes.

I can take a look at this and profile it over the next few days.

borrrden commented 4 months ago

Let me know if I can help in any way. I realize that differences in AVD definition could affect the results here. I have my AVD set to API 22 with 1024 MiB of memory for the main settings (I created it on the command line with avdmanager). I have no idea how profiling works since Xamarin died and its profiler was always enterprise only, and I don't see a replacement in the ".NET digit era". If I knew how to profile I would happily try to get some results.

The symptoms do seem to fit with large managed memory allocation since eventually it's going to ask for "another" large allocation and the GC won't run in time for Android's sanctification and get killed. It also fits that an unmanaged implementation would be less affected since it would immediately release its memory.

filipnavara commented 4 months ago

I have no idea how profiling works since Xamarin died and its profiler was always enterprise only, and I don't see a replacement in the ".NET digit era".

OT: I was long contemplating on doing a write up or a conference session on that. The gist is that the standard dotnet-trace (for sampled performance profiling) and dotnet-gcdump (for memory profiling) do work on mobile .NET platforms. There’s another tool, dotnet-dsrouter, that facilitates the communication between the standard tool and the mobile device / simulator. The MAUI wikis host tutorials on how to use the tools. Additionally, the awesome .NET Meteor extension to VS Code automates bulk of the setup and launches the appropriate viewers for the data (speedscope.app and dotnet-heapview).

vcsjones commented 4 months ago

After looking at this a bit, there is nothing leaking, it's just that the PBKDF2 implementation we had on Android was inefficient.

Recall that PBKDF2 works by using HMAC. A lot of HMAC. In the example above, 64000 HMAC invocations per PBKDF2 invocation (Only one block is derived assuming SHA-256 is the hash function).

Each HMAC invocation on Android goes from .NET to JNI, to the Mac class provided by the Android SDK. The Mac class design does not let us implement it efficiently by writing it to a Span. Rather, it must write to Java array, and we must copy it back to .NET. So for each HMAC invocation, we were creating a 32-byte array in the JVM, so 64,000 java arrays.

We have the same issue getting data in to HMAC with Update. We create a Java array, copy the .NET memory to it, then update the MAC. So that is another 64,000 Java arrays. We can probably look at using NIO here, which I will do in a separate pull request.

So, all told, we are creating at least iterations * 2 Java arrays, each the size of the PBKDF2 block (the output size of the HMAC function) and we are doing a bunch of copying from .NET to JNI and back.

So, what can we do about it. For the one-shot, we can keep everything in Java. https://github.com/dotnet/runtime/pull/103016 does this, and brings the allocations down to < 5, total, regardless of the number of iterations. It also eliminates all of the copying between .NET and Java with the exception of the key. This averages about to a 2x speed up and the memory usage significantly drops.

I will also be spending some time looking elsewhere where we can improve the crypto throughput in Android.