Remove MemoryHeavySQLiteCursor

david-allison commented 3 years ago

This must happen before publishing the first release - I expect this will likely crash low-memory devices for some operations.

rslib does not stream data from SQL currently, all data is loaded into memory, encoded into JSON, and sent to the Python.

https://github.com/ankitects/anki/blob/65d3a1393cbd111861221774f200f51b6ab3e89c/pylib/rsbridge/lib.rs#L81-L93

Rust can afford to avoid streaming SQL due to Rust avoiding the Java heap limit, Java doesn't have the luxury, so we should ideally stream the data.

david-allison commented 3 years ago

Source to change

https://github.com/david-allison-1/Anki-Android-Backend/blob/e4d101964398e3474833ef4689513db49dd1da99/rsdroid/src/main/java/net/ankiweb/rsdroid/database/MemoryHeavySQLiteCursor.java#L32-L40 https://github.com/david-allison-1/Anki-Android-Backend/blob/55cdd62496e9882e962be09672450d415d25a79d/rsdroid/src/main/java/net/ankiweb/rsdroid/BackendV1Impl.java#L138-L175 https://github.com/david-allison-1/Anki-Android-Backend/blob/5b75b4c4070f39f772d1b0c72aa339d684db34f5/rslib-bridge/src/lib.rs#L119-L145

Expected Java Error

``` java.lang.OutOfMemoryError: Failed to allocate a 553047794 byte allocation with 4194304 free bytes and 118MB until OOM at java.lang.String.(String.java:233) at java.lang.String.(String.java:149) at java.lang.String.(String.java:119) at net.ankiweb.rsdroid.BackendV1Impl.fullQueryInternal(BackendV1Impl.java:162) at net.ankiweb.rsdroid.BackendV1Impl.fullQuery(BackendV1Impl.java:142) at net.ankiweb.rsdroid.BackendMutex.fullQuery(BackendMutex.java:83) at net.ankiweb.rsdroid.database.Session.fullQuery(Session.java:127) at net.ankiweb.rsdroid.database.MemoryHeavySQLiteCursor.(MemoryHeavySQLiteCursor.java:34) at net.ankiweb.rsdroid.database.RustSupportSQLiteDatabase.query(RustSupportSQLiteDatabase.java:127) at net.ankiweb.rsdroid.database.RustSupportSQLiteDatabase.query(RustSupportSQLiteDatabase.java:122) at net.ankiweb.rsdroid.BackendSlowTests.memoryIsBounded(BackendSlowTests.java:103) at java.lang.reflect.Method.invoke(Native Method) at java.lang.reflect.Method.invoke(Method.java:372) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at androidx.test.internal.runner.junit4.statement.RunBefores.evaluate(RunBefores.java:80) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runners.Suite.runChild(Suite.java:128) at org.junit.runners.Suite.runChild(Suite.java:27) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at org.junit.runner.JUnitCore.run(JUnitCore.java:115) at androidx.test.internal.runner.TestExecutor.execute(TestExecutor.java:56) at androidx.test.runner.AndroidJUnitRunner.onStart(AndroidJUnitRunner.java:395) at android.app.Instrumentation$InstrumentationThread.run(Instrumentation.java:1837) ```

david-allison commented 3 years ago

During testing, Rust threw an OOM which was not handled by catch_unwind. This is disappointing and seems to be the default OOM behaviour for Rust. Instead of an Exception, we receive a SIGABRT.

We don't get many OOMs, but the inability to catch them from Java is frustrating - ACRA doesn't handle them, and Google Play typically won't give appropriate detail (no logs, will be hard to get a method signature from the Java)

It also did not seem that the Rust OOM limit was far from the Java OOM limit. EDIT: We had a large VM Heap on the emulator which explains this.

All-in-all, this brings the assumption that Rust will be fine due to no heap limit into question. Additional real-world testing will need to be performed, and we may need to discuss streaming data inside rslib (on top of over the JNI boundary - which will happen in this ticket) with dae.

If this is an issue, it will get better over time. All of the functionality which is likely to be expensive (imports, maybe syncing) can be translated to Rust, and we can use this after #1 is completed.

EDIT: Pagination (LIMIT/OFFSET) on the Java side might also be possible - would need significant testing, but could solve this for us at a performance cost.

Rust Error

``` I/DEBUG: *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** I/DEBUG: Build fingerprint: 'generic_x86/sdk_google_phone_x86/generic_x86:5.0.2/LSY66K/6695550:eng/test-keys' I/DEBUG: Revision: '0' I/DEBUG: ABI: 'x86' I/DEBUG: pid: 27721, tid: 27737, name: roidJUnitRunner >>> net.ankiweb.rsdroid.instrumented <<< I/DEBUG: signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr -------- I/DEBUG: eax 00000000 ebx 00006c49 ecx 00006c59 edx 00000006 I/DEBUG: esi af3ffdb8 edi 0000000c I/DEBUG: xcs 00000073 xds 0000007b xes 0000007b xfs 0000005f xss 0000007b I/DEBUG: eip b76f1ea6 ebp 00006c59 esp af3feba0 flags 00200282 I/DEBUG: backtrace: I/DEBUG: #00 pc 00073ea6 /system/lib/libc.so (tgkill+22) I/DEBUG: #01 pc 00021b1b /system/lib/libc.so (pthread_kill+155) I/DEBUG: #02 pc 00023394 /system/lib/libc.so (raise+36) I/DEBUG: #03 pc 0001b874 /system/lib/libc.so (abort+84) I/DEBUG: #04 pc 00ad1d34 /data/app/net.ankiweb.rsdroid.instrumented.test-2/lib/x86/librsdroid.so I/DEBUG: #05 pc 00abd7e4 /data/app/net.ankiweb.rsdroid.instrumented.test-2/lib/x86/librsdroid.so (std::process::abort::h065e619536580e75+20) I/DEBUG: #06 pc 00ac34b5 /data/app/net.ankiweb.rsdroid.instrumented.test-2/lib/x86/librsdroid.so (rust_oom+53) I/DEBUG: #07 pc 00aec02f /data/app/net.ankiweb.rsdroid.instrumented.test-2/lib/x86/librsdroid.so (alloc::alloc::handle_alloc_error::hd089a51289db2e22+31) I/DEBUG: #08 pc 002d22cd /data/app/net.ankiweb.rsdroid.instrumented.test-2/lib/x86/librsdroid.so I/DEBUG: #09 pc 002a718e /data/app/net.ankiweb.rsdroid.instrumented.test-2/lib/x86/librsdroid.so I/DEBUG: #10 pc 0028320c /data/app/net.ankiweb.rsdroid.instrumented.test-2/lib/x86/librsdroid.so I/DEBUG: #11 pc 002b8c9c /data/app/net.ankiweb.rsdroid.instrumented.test-2/lib/x86/librsdroid.so I/DEBUG: #12 pc 00278e48 /data/app/net.ankiweb.rsdroid.instrumented.test-2/lib/x86/librsdroid.so I/DEBUG: #13 pc 00321c41 /data/app/net.ankiweb.rsdroid.instrumented.test-2/lib/x86/librsdroid.so (anki::backend::Backend::db_command::hb9cda1c640b22898+209) I/DEBUG: #14 pc 00321d0d /data/app/net.ankiweb.rsdroid.instrumented.test-2/lib/x86/librsdroid.so (anki::backend::Backend::run_db_command_bytes::hbd3014586c1b3ec5+45) I/DEBUG: #15 pc 001b0f29 /data/app/net.ankiweb.rsdroid.instrumented.test-2/lib/x86/librsdroid.so I/DEBUG: #16 pc 001b038e /data/app/net.ankiweb.rsdroid.instrumented.test-2/lib/x86/librsdroid.so (Java_net_ankiweb_rsdroid_NativeMethods_fullDatabaseCommand+62) I/DEBUG: #17 pc 0037bd1e /data/dalvik-cache/x86/data@app@net.ankiweb.rsdroid.instrumented.test-2@base.apk@classes.dex I/DEBUG: Tombstone written to: /data/tombstones/tombstone_02 I/Zygote: Process 27721 exited due to signal (6) W/ActivityManager: Error in app net.ankiweb.rsdroid.instrumented running instrumentation ComponentInfo{net.ankiweb.rsdroid.instrumented.test/androidx.test.runner.AndroidJUnitRunner}: W/ActivityManager: Native crash W/ActivityManager: Native crash: Aborted D/AndroidRuntime: Shutting down VM I/art: Debugger is no longer active I/ActivityManager: Force stopping net.ankiweb.rsdroid.instrumented appid=10066 user=0: finished inst I/ActivityManager: Killing 27721:net.ankiweb.rsdroid.instrumented/u0a66 (adj 0): stop net.ankiweb.rsdroid.instrumented W/libprocessgroup: failed to open /acct/uid_10066/pid_27721/cgroup.procs: No such file or directory W/ActivityManager: Spurious death for ProcessRecord{f2bc55 27721:net.ankiweb.rsdroid.instrumented/u0a66}, curProc for 27721: null 2072-2087/com.google.android.gms.persistent I/art: Background sticky concurrent mark sweep GC freed 19248(909KB) AllocSpace objects, 2(32KB) LOS objects, 4% free, 20MB/21MB, paused 25.469ms total 401.130ms D/AndroidRuntime: >>>>>> AndroidRuntime START com.android.internal.os.RuntimeInit <<<<<< ```

Arthur-Milchior commented 3 years ago

It is not clear to me, is there a way to know all places where such OOM is realistic? In this case whether it would be possible to just make check in those parts. For example, for sql data, limiting the size of data load at once; this is what was done for big json in collections (note type and deck)

david-allison commented 3 years ago

I've added:

I'm not too happy with the LimitOffsetSQLiteCursor::getCount() method, but this seems to avoid the problem.

dae commented 3 years ago

re https://github.com/david-allison-1/Anki-Android-Backend/commit/0ee773d6bde3b73669d904a1d27a70e55e99121f#diff-ae1a973a64affaa5b18ac55b7af4e7b61d2509b657274051310a3d10cbb94c58R103

That seems like quite a high allocation for ~60 bytes 5000 10. I'd be curious to know the length of the byte[] you're getting back from the backend there, as the raw data + json shouldn't be much over 3-4MB. Perhaps it will help if you wrap the bytes in an InputStream and get the array via https://stleary.github.io/JSON-java/org/json/JSONTokener.html, so you can avoid having to allocate a string just to parse the JSON.

But even having said that, it seems like there might be something else going on to get that much memory consumed. Are these tests being run in parallel, or are you running this test by itself? If the issue exists even when executing the test by itself, I wonder how much memory is being consumed by the act of adding the data. If Android offers a way to dump the current memory usage of a process and/or force a gc, it would be interesting to see how it changes over the course of that test.

Limit+offset may tank performance in cases where queries can't be serviced from an index - if the query does a full table scan or needs to sort afterward (eg 'order by' in the search queries), that cost is going to be multiplied by the number of pages you need to serve, so you may need to apply this more selectively if you go down this route.

One possible alternative would be to dump the JSON records one per line into a temp file if they exceed a certain count in the backend DB method. The frontend could then read the JSONL back in on demand as the cursor is iterated. No added overhead for smaller queries, and you'd avoid having to scan or order multiple times. It wouldn't solve the abort issue you got though. To reduce memory usage on the backend side, you could try reducing the default 40MB sqlite cache upper bound (https://github.com/david-allison-1/anki/blob/a148304fb3fac4f0042b01bd23cdbbefc3b1d151/rslib/src/storage/sqlite.rs#L45), and/or add a separate method that spools directly into a file instead of building a vec first.

david-allison commented 3 years ago

re 0ee773d#diff-ae1a973a64affaa5b18ac55b7af4e7b61d2509b657274051310a3d10cbb94c58R103

That seems like quite a high allocation for ~60 bytes 5000 10. I'd be curious to know the length of the byte[] you're getting back from the backend there, as the raw data + json shouldn't be much over 3-4MB. Perhaps it will help if you wrap the bytes in an InputStream and get the array via https://stleary.github.io/JSON-java/org/json/JSONTokener.html, so you can avoid having to allocate a string just to parse the JSON.

But even having said that, it seems like there might be something else going on to get that much memory consumed. Are these tests being run in parallel, or are you running this test by itself? If the issue exists even when executing the test by itself, I wonder how much memory is being consumed by the act of adding the data. If Android offers a way to dump the current memory usage of a process and/or force a gc, it would be interesting to see how it changes over the course of that test.

Test should have been better documented. It was explicitly designed as a stress test. 50 2 (UTF-16) 5000 * 2^10 = ~500MB

Limit+offset may tank performance in cases where queries can't be serviced from an index - if the query does a full table scan or needs to sort afterward (eg 'order by' in the search queries), that cost is going to be multiplied by the number of pages you need to serve, so you may need to apply this more selectively if you go down this route.

Agreed - would want

One possible alternative would be to dump the JSON records one per line into a temp file if they exceed a certain count in the backend DB method. The frontend could then read the JSONL back in on demand as the cursor is iterated. No added overhead for smaller queries, and you'd avoid having to scan or order multiple times. It wouldn't solve the abort issue you got though. To reduce memory usage on the backend side, you could try reducing the default 40MB sqlite cache upper bound (https://github.com/david-allison-1/anki/blob/a148304fb3fac4f0042b01bd23cdbbefc3b1d151/rslib/src/storage/sqlite.rs#L45), and/or add a separate method that spools directly into a file instead of building a vec first.

Touching the disk here is probably more trouble than it's worth. I'd guesstimate that a large percentage of our issues (~20%) come from ENOSPC and Android deleting cache files, and it feels fair to assume that a low RAM device will also be likely to be constrained by disk space.

Sample size of 1: I don't have problems with AnkiDroid crashes/space, but I typically have less free disk space than RAM (~500MB-1.5GB space, 3GB RAM).

As mentioned in #6, we'll get a lot of breathing room (~3-5x less RAM) by moving to Protobuf for SQL, and serialisation will likely get in the way

Going forward, this seems to be the confluence of two related, but separate issues:

Rust OOM - True "Out of Memory"
- Undiagnosable crash from logs (we want to avoid at all costs)
- Likely won't be truly undiagnosable - a pathological query would cause a Heap OOM first.
- Fixable via a suitably large LIMIT/OFFSET at a performance cost.
- Potentially turn this on at a high level (50,000?) for [ActivityManager.isLowRamDevice](https://developer.android.com/reference/android/app/ActivityManager#isLowRamDevice()), maybe also if we receive a low memory signal from the JVM?
- Maybe also trigger if we get a [onLowMemory](https://developer.android.com/reference/android/content/ComponentCallbacks#onLowMemory())?
- :warning: might be playing fast & loose here
Java OOM - Heap Size
- Diagnosable
- Fixable: load all json into memory, and stream across the JNI boundary at a decent block size (~4MB).
  - Similar to https://github.com/ankidroid/Anki-Android/blob/8e73956efc03b8ef97317e28cac38664569a270c/AnkiDroid/src/main/java/com/ichi2/libanki/Collection.java#L314-L350

Maybe we also want to get #6 into the V1 release?

We're probably not going to find out whether I'm worrying too much about this until we go to beta.

I'm leaning conservative with this, ideally going ahead and getting #6 in, along with streaming the protobufs. I'd rather not annoy users with this change if at all possible.

ankidroid / Anki-Android-Backend

Remove MemoryHeavySQLiteCursor #6