celzero / rethink-app

DNS over HTTPS / DNS over Tor / DNSCrypt client, WireGuard proxifier, firewall, and connection tracker for Android.
https://rethinkfirewall.com/
Apache License 2.0
2.94k stars 148 forks source link

Large memory consumption #572

Closed Artim96 closed 7 months ago

Artim96 commented 2 years ago

Hey, I was wondering if it was normal that the app is used really large amounts of system memory. In developer settings under memory, Android shows values between 1.7 and 2.5 GB of used memory (depending on the interval I set, with 1.7 GB showing at the 24 h interval and 2.5 GB showing at the 3 h interval). After force closing it, it goes back about 400 MB, which would be a much more sane amount. So the question is, is that a problem with the app or with the Android version I use.

I'm currently on Beta 1 of the quarterly Feature Drops of Android 13 on the Pixel 6. Maybe someone on Non-Beta Android 13 can take a look. If it's a problem with Android of course it would be ideal if Google could fix it before the version goes live. Of course if it's a problem with the app a fix would be appreciated too. I can't say how it was before the beta though since I just recently switched from a similar app that got unusable for me.

afontenot commented 2 years ago

N.B. the Pixel 6 has 8 GB of RAM.

I can confirm this issue on Android 12 (LineageOS). In fact it seems to be responsible for most of the crashes I've been seeing. My device only has 4 GB of RAM.

An example crash:

Log excerpt:

09-22 21:07:45.465  1562 29913 I ActivityManager: Low on memory:
09-22 21:07:45.466  1562 29913 I ActivityManager:   fg     TOP   118644: org.thoughtcrime.securesms (pid 26506) top-activity
09-22 21:07:45.466  1562 29913 I ActivityManager:   vis    FGS  1869523: com.celzero.bravedns (pid 29259) service
09-22 21:07:45.466  1562 29913 I ActivityManager:                        com.celzero.bravedns/.service.BraveVPNService<=Proc{1562:system/1000}
09-22 21:07:45.466  1562 29913 I ActivityManager:   Free RAM:   258,356K
09-22 21:07:45.466  1562 29913 I ActivityManager:   Used RAM: 3,637,306K
09-22 21:07:45.466  1562 29913 I ActivityManager:   Lost RAM:   372,080K
09-22 21:10:26.680  1562  3056 I ActivityManager: Process com.celzero.bravedns (pid 29259) has died: prcp FGS

I started a call in Signal around 2 minutes prior to this. It's the fg app you can see in the log above. A lot of random background tasks get killed until finally RethinkDNS dies just under 3 minutes into the low memory condition. (There's nothing else in logcat to indicate the cause of the process death.)

Here is the process exit as recorded by adb bugreport:

        ApplicationExitInfo #1:
          timestamp=2022-09-22 21:10:26.682
          pid=29259
          realUid=10272
          packageUid=102A72
          definingUid=10272
          user=0
          process=com.celzero.bravedns
          reason=2 (SIGNALED)
          subreason=3 (TOO MANY EMPTY PROCS)
          status=9
          importance=125
          pss=0.88GB
          rss=750MB
          description=null
          state=empty
          trace=null

750MB RSS seems extremely high, probably indicating a leak. My guess is that Rethink gets killed on my phone well before it can hit the 2 GB level seen by the bug reporter.

ignoramous commented 2 years ago

My device only has 4 GB of RAM.

Android is guilty of taking up half that... Then there's so many other System / Privileged apps that are exempt from OOM... 4Gs is just too... stringent ;)

750MB RSS seems extremely high

Yep, it was 190M in v044; then moved up to 300M in one of the v05x releases. 700M+ sounds ridiculous. Checking...

TOO MANY EMPTY PROCS indicates Rethink's being torpedoed by ActivityManager, for having "empty processes"? https://cs.android.com/android/platform/superproject/+/master:frameworks/base/services/core/java/com/android/server/am/OomAdjuster.java;drc=7d7ee57b4dbbe4723c19cf5308eee10d6c6cf2a6;l=1099

ignoramous commented 2 years ago

On my OnePlus, the resident set size (RSS) is 135M and the app has been running for 2 days non-stop with DNS + Firewall.

    138,884K: com.celzero.bravedns (pid 16630 / activities)

I wonder if you folks have local blocklists enabled which has a leak (because blocklists are around 60M in-memory and if there was a leak, then I can see how it'd reach 800M+). If so, can you see if disabling local blocklists (and/or using remote blocklists instead), results in lesser RAM use and stays that way?

Btw, if you're so inclined, you can examine the heap on https://ui.perfetto.dev/ and share the traces.

Artim96 commented 2 years ago

I am using the On-device block lists. And yes, I just killed the app again with way over 1 GB usage over the last 3 hours.

hat exactly to you mean by remote block lists?

And I've never used perfetto. Depending on how self-explanatory it is I can see if I can share traces

ignoramous commented 2 years ago

[W]hat exactly to you mean by remote block lists?

RethinkDNS+ (our resolver), but one can deploy their own or use a service like NextDNS, AdGuard DNS, ControlD etc).

And I've never used perfetto. Depending on how self-explanatory it is I can see if I can share traces

You'd need a debug build of the app. I am looking but prefetto's heap_profile and java_heap_dump have been unhelpful so far. From adb shell dumpsys meminfo com.celzero.bravedns it is clear that with local-blocklists enabled, after changing the DNS upstream a few times, or pausing / resuming the Rethink firewall a few times, app's RSS balloons past 1G.

afontenot commented 2 years ago

it is clear that with local-blocklists enabled, after changing the DNS upstream a few times, or pausing / resuming the Rethink firewall a few times, app's RSS balloons past 1G.

I have also discovered that simply disabling on-device blocklists causes the extra memory to be GCed immediately without making any other changes. PSS went from 1.3G → 0.2G.

Also, not sure if it's relevant, but on my device I usually see much higher PSS than RSS. This doesn't make sense, because PSS is supposed to be lower. The explanation seems to be that PSS counts "system" memory in its usage and RSS doesn't. I don't know what type of memory usage this is, exactly, but I'm reproducing the meminfo below:

In my testing so far, I have only seen OOM when there's also high RSS.

 App Summary
                       Pss(KB)                        Rss(KB)
                        ------                         ------
           Java Heap:     8536                          21992
         Native Heap:    10660                          11828
                Code:     4320                          35088
               Stack:      724                            732
            Graphics:     1616                           1616
       Private Other:   113992
              System:  1147686
             Unknown:                                  116232

           TOTAL PSS:  1287534            TOTAL RSS:   187488       TOTAL SWAP PSS:  1145221
ignoramous commented 2 years ago

Interesting, on my OnePlus, ~776M shows up in RSS:

 App Summary                                                                    
                       Pss(KB)                        Rss(KB)                   
                        ------                         ------                   
           Java Heap:    32656                          49248                   
         Native Heap:    40524                          41820                   
                Code:    42524                         138016                   
               Stack:     1732                           1740                   
            Graphics:    11236                          11236                   
       Private Other:   772976                                                  
              System:    13731                                                  
             Unknown:                                  776780                   

           TOTAL PSS:   915379            TOTAL RSS:  1018840       TOTAL SWAP PSS:      252

Also, the code-segment in my sample is high at 130M.

So: What I am doing is,

  1. Kill Rethink
  2. Start it back up with local-blocklists enabled.
  3. Pause the app, and resume after 2 secs.
  4. Flip DNS between System DNS and any other DNS.
  5. Enable/Disable Allow Bypass twice or thirce.
  6. adb shell dumpsys meminfo com.celzero.bravedns
afontenot commented 2 years ago

Pause the app, and resume after 2 secs.

Even just pausing and unpausing a few times bloats it for me, RSS 193M (with local blocklists enabled) → 813M.

This may not surprise you, as you're already familiar with the code, but I tried setting some breakpoints and found that the memory seems to be leaked with the call to Dnsx.newBraveDNSLocal().

What I found surprising was that this function also gets called when you press pause, not just when you press the stop button on the pause screen (or otherwise activate local DNS). It actually sometimes gets called multiple times, for reasons that aren't clear to me.

At any rate, every time it gets called, after waiting a few seconds for some GC to happen, the amount of memory the profiler detects as "unknown" grows by a very consistent 77.7 MB. That is obviously very close to the ~68 MB current size of the trie+rank files.

I'm not really qualified to debug further, but it looks a bit strange to me that NewBraveDNSLocal() is rebuilding the internal trie object every time rather than reusing it. trie.Build() is not that fast and uses a lot of memory, actually. You can see the spike when profiling before it gets GCed a second or two later.

Let me know if there's anything I can help debug.

ignoramous commented 2 years ago

I was half-expecting that the leak would be with the trie impl. It is not clear why, because the older trie is unset here before a new one is created with Dnsx#newBraveDNSLocal:

https://github.com/celzero/rethink-app/blob/0e2339d38317145070747b3ed8891cdca8c25aea/app/src/main/java/com/celzero/bravedns/net/go/GoVpnAdapter.kt#L144

And the golang runtime invokes GC before sending the new trie obj across to the kotlin-land:

https://github.com/celzero/firestack/blob/9da081d30e21060087bf2c0e8a7b9bec46a6f103/intra/dnsx/bravedns.go#L328

It actually sometimes gets called multiple times, for reasons that aren't clear to me.

local-blocklists are setup every time the tunnel is "updated" (DNS changes, VPN restarts, network changes etc).

I'm not really qualified to debug further, but it looks a bit strange to me that NewBraveDNSLocal() is rebuilding the internal trie object every time rather than reusing it. trie.Build() is not that fast

Inefficient, but it isn't supposed to be expensive (the trie holds 5M blocklist entries) on the CPU or RAM (or so we thought; because the file is a stream of bytes that can be seeked-into as-is... but now that I look at the impl, that isn't what's happening: the file is being read into a u16 byte-array, rather than being mmap'ed).

...the amount of memory the profiler detects as "unknown" grows by a very consistent 77.7 MB. That is obviously very close to the ~68 MB current size of the trie+rank files.

Might be golang obj overhead, though, the prefetto ui wouldn't show me anything.

Let me know if there's anything I can help debug.

Thanks. The 'cause of memory leak is clear (trie), but why isn't... I;ll let you know how it goes.

ignoramous commented 2 years ago

After a few VPN restarts with local blocklists enabled, Android Studio's Memory Profiler (for Android Q+) shows that there's 100s of MBs waiting in the queue waiting to be finalized (categorized as Other). As mentioned above, these allocs belong to Dnsx.BraveDNS created in the golang side of things.

What we've tried so far that has made no difference:

  1. Setting runtime.debug.SetMemoryLimit() to 100M (here).
  2. Removing the need for kotlin-land to retrieve dnsx.BraveDNS (local blocklists) proxy obj via tunnel.GetBraveDNS or have a use for Dnsx.NewBraveDNSLocal.
  3. Converting trie and rank vars to use pointers (was never going to make a difference as slices in golang are passed-by-reference anyway).

What we have noticed is that the size of "leaked" memory (or, memory waiting to be finalized) is higher on devices with more RAM. On a device with 4G sized RAM, the app seemed to have settled at 700M. So, we wonder if JVM GC settings can be tightened to make sure the finalizer queue doesn't grow as big?

It is unclear if the way objs are exchanged through the GoMobile/JVM boundaries is the root-cause of this. But if it is (which is likely), then a rewrite is needed on how on-device (local) blocklists and related interaction across language boundaries is implemented... and that's going to take a while. In the short term, for v053k, we are inclined to impl a solution that creates dnsx.BraveDNS just once (this will break a few cases. For example, newer blocklists, if downloaded when the app is already active, will not override the existing ones until the app process is restarted).

This "leak" also exists in v053g (the earliest version we tested), though less severe for reasons unknown.

afontenot commented 2 years ago

What we have noticed is that the size of "leaked" memory (or, memory waiting to be finalized) is higher on devices with more RAM. On a device with 4G sized RAM, the app seemed to have settled at 700M. So, we wonder if JVM GC settings can be tightened to make sure the finalizer queue doesn't grow as big?

This isn't true in my experience. In testing, by pausing and unpausing the firewall I was able to get up to 1.7 GB and probably beyond on real hardware with 4G RAM, as shown in the Android Studio profiler. (You reach a point where Android starts having problems, so I haven't pushed it to the limits yet.)

In the short term, for v053k, we are inclined to impl a solution that creates dnsx.BraveDNS just once (this will break a few cases. For example, newer blocklists, if downloaded when the app is already active, will not override the existing ones until the app process is restarted).

Strikes me as a pretty reasonable short term fix, at least until the underlying cause can be worked out. Maybe, in this case, it would make sense to prompt the user to restart the app (along with a button to do so)?

ignoramous commented 2 years ago

via Oliver (@Uldiniad),

A related golang thread (but the issue mentioned there-in has since been fixed): https://groups.google.com/g/golang-nuts/c/-7ehtvFzJj0

My experimental method using the test app I linked:

  • Allocate and initialize 300MB within the Go runtime.
  • Free the 300MB
  • See how much can be allocated using the native runtime before the OS kills the app. This only got to about 100MB.

In an independent run where I just allocate native memory the OS kills the app at around 400MB, which would indicate that the 300MB of Go memory was counting against the total.

One thing to note here is that I'm both allocating and initializing the memory in Go. If I just allocate the memory (an empty array for example) then the memory does seem to stay clean. Once it's been initialized and then freed it seems to stay dirty though.


Hi Eric. Are you building for arm (32-bit) or arm64? Sadly, because of golang.org/issue/9993, we currently don't release memory back to the OS on arm64.

Ah! Issue 9993 explains a lot. I am indeed running on arm64 and x86_64 (simulator on OS X). I've seen it return memory to the OS in the simulator before, but I think I've only seen it do so successfully one time. Other times I've been unable to get the resident memory to drop.

I gave gctrace=1 a shot, and it does indeed seem to be trying to release memory ("scvg-1: 70MB released" for example), but consistent with my previous findings, this doesn't lower the resident memory of the process. This was in the simulator, which is running on x86_64. I can try on-device as well (arm64), but that's been even more stubborn about returning memory and I suspect it's the same.

If we really can't reclaim memory from the Go runtime that might be a blocker to using Go Mobile. Any timing info you can provide on when this might be addressed (or options for patches, etc) is helpful.


iOS test app: https://github.com/efirestone/GoMobileMemoryTester

ignoramous commented 2 years ago

v053k has a workaround where the memory for the blocklist is allocated just the once for the entire duration of app's. This means, if a user updates blocklists while the app is still running, it won't take effect until the app is restarted (as much is communicated to the user in a toast msg post-download).

ignoramous commented 1 year ago

We've fixed a bunch of memory copies by doing some terrible hacks. May the software gods have mercy on us.

Mygod commented 1 year ago

It looks like I am seeing the same issues. (dumpsys activity exit-info)

  package: com.celzero.bravedns
    Historical Process Exit for uid=10400
        ApplicationExitInfo #0:
          timestamp=2023-09-26 18:24:42.378 pid=1668 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=3 (LOW_MEMORY) subreason=0 (UNKNOWN) status=0
          importance=400 pss=16MB rss=64MB description=null state=empty trace=null
        ApplicationExitInfo #1:
          timestamp=2023-09-26 18:01:31.542 pid=8662 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=13 (OTHER KILLS BY SYSTEM) subreason=3 (TOO MANY EMPTY PROCS) status=0
          importance=400 pss=87MB rss=104MB description=too many empty state=empty trace=null
        ApplicationExitInfo #2:
          timestamp=2023-09-25 12:50:54.347 pid=28017 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=10 (USER REQUESTED) subreason=21 (FORCE STOP) status=0
          importance=400 pss=295MB rss=235MB description=stop com.celzero.bravedns due to from pid 12182 state=empty trace=null
        ApplicationExitInfo #3:
          timestamp=2023-09-24 17:01:50.659 pid=7215 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=3 (LOW_MEMORY) subreason=0 (UNKNOWN) status=0
          importance=400 pss=129MB rss=178MB description=null state=empty trace=null
        ApplicationExitInfo #4:
          timestamp=2023-09-24 16:06:57.143 pid=16998 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=5 (APP CRASH(NATIVE)) subreason=0 (UNKNOWN) status=6
          importance=125 pss=173MB rss=209MB description=crash state=empty trace=null
        ApplicationExitInfo #5:
          timestamp=2023-09-23 21:58:05.053 pid=13753 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=3 (LOW_MEMORY) subreason=0 (UNKNOWN) status=0
          importance=400 pss=0.00 rss=0.00 description=null state=empty trace=null
        ApplicationExitInfo #6:
          timestamp=2023-09-23 21:21:41.185 pid=21084 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=5 (APP CRASH(NATIVE)) subreason=0 (UNKNOWN) status=6
          importance=125 pss=73MB rss=118MB description=crash state=empty trace=null
        ApplicationExitInfo #7:
          timestamp=2023-09-23 20:53:23.134 pid=18584 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=3 (LOW_MEMORY) subreason=0 (UNKNOWN) status=0
          importance=400 pss=15MB rss=84MB description=null state=empty trace=null
        ApplicationExitInfo #8:
          timestamp=2023-09-23 20:44:59.323 pid=12858 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=3 (LOW_MEMORY) subreason=0 (UNKNOWN) status=0
          importance=400 pss=20MB rss=85MB description=null state=empty trace=null
        ApplicationExitInfo #9:
          timestamp=2023-09-23 20:34:46.850 pid=4225 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=3 (LOW_MEMORY) subreason=0 (UNKNOWN) status=0
          importance=400 pss=226MB rss=187MB description=null state=empty trace=null
        ApplicationExitInfo #10:
          timestamp=2023-09-18 15:58:36.281 pid=4377 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=3 (LOW_MEMORY) subreason=0 (UNKNOWN) status=0
          importance=400 pss=0.00 rss=0.00 description=null state=empty trace=null
        ApplicationExitInfo #11:
          timestamp=2023-09-18 15:29:55.077 pid=29759 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=3 (LOW_MEMORY) subreason=0 (UNKNOWN) status=0
          importance=400 pss=32MB rss=89MB description=null state=empty trace=null
        ApplicationExitInfo #12:
          timestamp=2023-09-18 15:20:18.261 pid=8372 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=3 (LOW_MEMORY) subreason=0 (UNKNOWN) status=0
          importance=400 pss=55MB rss=104MB description=null state=empty trace=null
        ApplicationExitInfo #13:
          timestamp=2023-09-18 13:54:55.960 pid=13862 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=3 (LOW_MEMORY) subreason=0 (UNKNOWN) status=0
          importance=400 pss=104MB rss=147MB description=null state=empty trace=null
        ApplicationExitInfo #14:
          timestamp=2023-09-18 12:22:22.764 pid=10463 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=3 (LOW_MEMORY) subreason=0 (UNKNOWN) status=0
          importance=400 pss=17MB rss=72MB description=null state=empty trace=null
        ApplicationExitInfo #15:
          timestamp=2023-09-18 12:03:18.922 pid=4395 realUid=10400 packageUid=10400 definingUid=10400 user=0
          process=com.celzero.bravedns reason=3 (LOW_MEMORY) subreason=0 (UNKNOWN) status=0
          importance=400 pss=26MB rss=89MB description=null state=empty trace=null
ignoramous commented 7 months ago

High memory use with on-device blocklists has been fixed in v055b (2 March 2024) onwards.

Please feel free to re-open if that's not the case.

Artim96 commented 7 months ago

Thanks. Right now it's sitting at around 700 MB with a maximum of 1.2 GB. My guess would be that that's expected so case closed for me too. If someone experiences way bigger usage, this is your comparison with 13 block lists being active (running Android 14 QPR3 beta 2).

ignoramous commented 7 months ago

Well, 700mb is too high, but the fix should at least stop memory leaks everytime on-device blocklist is turned on/off in cycles. Previously, doing so increased memory use by 100mb+ per cycle.