Concurrency issues after upgrade to 0.11.0

chris-allan commented 1 year ago

Thanks to all for the quick turnaround resolving #301.

Unfortunately we've hit a much deeper snag performing an upgrade starting with 0.11.0. Nearly as soon as we re-open in read mode an Array we had previously been writing to we get native code errors or what appears to be a deadlock. This only happens after writing many, many overlapping chunks. This does not happen with 0.10.1.

With our production code this is the behaviour:

TileDB-Java 0.11.0 (TileDB 2.9.0)

[2023-07-25 17:49:14.391] [Process: 2565256] [error] [Global] [TileDB::FragmentMetadata] Error: Trying to access metadata that's not loaded
Caused by: io.tiledb.java.api.TileDBError: [TileDB::FragmentMetadata] Error: Trying to access metadata that's not loaded
    at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
    at io.tiledb.java.api.Context.handleError(Context.java:142)
    at io.tiledb.java.api.Query.submit(Query.java:130)
...

TileDB-Java 0.13.0 (TileDB 2.11.0)

[2023-07-25 13:57:01.358] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:03.753] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:08.940] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:08.942] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:08.943] [Process: 889580] [error] [Global] [TileDB::FragmentMetadata] Error: Trying to access metadata that's not loaded
[2023-07-25 13:57:09.289] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:11.144] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
...
Caused by: io.tiledb.java.api.TileDBError: [TileDB::FragmentMetadata] Error: Trying to access metadata that's not loaded
    at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
    at io.tiledb.java.api.Context.handleError(Context.java:142)
    at io.tiledb.java.api.Query.submit(Query.java:130)
...

TileDB-Java 0.14.1 (TileDB 2.12)

[2023-07-25 13:43:49.710] [Process: 1772240] [error] [1690288813910863900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:43:49.712] [Process: 1772240] [error] [1690288813910863900-Global] Error: Internal TileDB uncaught exception; device or resource busy: device or resource busy
[2023-07-25 13:43:50.810] [Process: 1772240] [error] [1690288813910863900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:43:50.811] [Process: 1772240] [error] [1690288813910863900-Global] Error: Internal TileDB uncaught exception; device or resource busy: device or resource busy
...
Caused by: io.tiledb.java.api.TileDBError: Error: Internal TileDB uncaught exception; device or resource busy: device or resource busy
    at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
    at io.tiledb.java.api.Context.handleError(Context.java:142)
    at io.tiledb.java.api.Query.submit(Query.java:130)
...

TileDB-Java 0.15.2 (TileDB 2.13.2)

Hang or deadlock. Worker stack traces (collected via jstack) are:

"pool-1-thread-1" #23 prio=5 os_prio=0 cpu=89546.88ms elapsed=340.37s tid=0x0000020972a92800 nid=0x3d32c runnable  [0x000000be392fe000]
   java.lang.Thread.State: RUNNABLE
        at io.tiledb.libtiledb.tiledbJNI.tiledb_query_submit(Native Method)
        at io.tiledb.libtiledb.tiledb.tiledb_query_submit(tiledb.java:2853)
        at io.tiledb.java.api.Query.submit(Query.java:130)
...

TileDB-Java 0.16.1 (TileDB 2.14.1)

Works for a while, dies later.

[2023-07-25 14:20:19.628] [Process: 1313796] [error] [1690290969279989200-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 14:20:19.632] [Process: 1313796] [error] [1690290969279989200-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 14:20:20.317] [Process: 1313796] [error] [1690290969279989200-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 14:20:20.317] [Process: 1313796] [error] [1690290969279989200-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
...
Caused by: io.tiledb.java.api.TileDBError: C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
    at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
    at io.tiledb.java.api.Context.handleError(Context.java:144)
    at io.tiledb.java.api.Query.submit(Query.java:130)
...

TileDB-Java 0.17.8 (TileDB 2.15.4)

[2023-07-25 16:45:09.726] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:09.729] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:09.921] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:09.922] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:10.401] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:10.402] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:10.617] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:10.617] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:10.762] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:10.763] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:11.033] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:11.034] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:11.776] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:11.777] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
...
Caused by: io.tiledb.java.api.TileDBError: C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
    at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
    at io.tiledb.java.api.Context.handleError(Context.java:144)
    at io.tiledb.java.api.Query.submit(Query.java:130)
...

I've put together a limited example which reproduces this:

https://github.com/chris-allan/tiledb-java-torture

It fails like this:

...
Inserting rectangle: [11078, 21057]
Inserting rectangle: [12093, 21090]
Not consolidating H:\code\tiledb-java-torture\tiledb_14210934035379110573\0
Creating TileDB array: H:\code\tiledb-java-torture\tiledb_14210934035379110573\1
[2023-07-26 13:28:54.114] [Process: 2200528] [error] [1690373955299503400-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-26 13:28:54.114] [Process: 2200528] [error] [1690373955299503400-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
Exception during execution
java.util.concurrent.CompletionException: io.tiledb.java.api.TileDBError: C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
    at java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1423)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
    at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
    at com.glencoesoftware.tiledb.Main.lambda$2(Main.java:404)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.tiledb.java.api.TileDBError: C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
    at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
    at io.tiledb.java.api.Context.handleError(Context.java:144)
    at io.tiledb.java.api.Query.submit(Query.java:130)
    at com.glencoesoftware.tiledb.Main.processTile(Main.java:359)
    at com.glencoesoftware.tiledb.Main.lambda$2(Main.java:402)
    ... 3 more

That above output snippet from Windows 10. Linux behaves similarly but not identically.

The code reflects the pattern from our production code that relies on TileDB fairly well:

Process a large number of tiles writing them in a non-adjacent fashion from multiple workers to a 5-dimensional TileDB Array
Downsample from the Array [1] and write to new 5-dimensional Array for each new "resolution"

20 channels is about right to produce the errors; ~11000 fragments. If less data is processed, things proceed as normal. The issue occurs with or without consolidation.

ihnorton commented 1 year ago

Hi @chris-allan, thanks for opening the story and providing a reproduction. We are investigating.

chris-allan commented 1 year ago

Thanks, @ihnorton.

Just an FYI for the team that I've been running the test case a few more times trying to pick apart what might be going wrong and so far about half the time things block on Query.submit() for a very long time. So you may or may not get those native code exceptions and a clean exit depending on how patient you are.

chris-allan commented 1 year ago

At time of original submission we were at chris-allan/tiledb-java-torture@d2e71cb. I've just pushed chris-allan/tiledb-java-torture@009b39c which does more logging and has explanations for the intent of each instance variable configuration options. While making these improvements and doing a few runs I noticed, just once, what appears to be a non-fatal log line being printed out during execution:

...
Read rectangle: [0, 1000]; status: TILEDB_COMPLETED
Read rectangle: [1000, 1000]; status: TILEDB_COMPLETED
[2023-07-28 11:47:11.266] [Process: 370504] [error] [1690541212777298500-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-28 11:47:11.266] [Process: 370504] [error] [1690541212777298500-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
Read rectangle: [2000, 1000]; status: TILEDB_COMPLETED
Read rectangle: [4000, 1000]; status: TILEDB_COMPLETED
...

This occurred when processing a single channel.

ihnorton commented 1 year ago

Hi @chris-allan, updating that we are still actively working on this -- @DimitrisStaratzis has reproduced and is working through with @KiterLuc.

chris-allan commented 1 year ago

Thanks, @ihnorton. I'll be out of the office for a couple weeks starting Monday but will try to keep an eye on things and answer any questions that come up as quickly as I can.

KiterLuc commented 1 year ago

Hi @chris-allan, we've identified the root cause of your issue, and we've prepared a pull request to address it. We anticipate being able to merge the solution within the next couple of days. We'll keep you updated on our progress.

KiterLuc commented 1 year ago

Hi @chris-allan, this release includes the fix: https://github.com/TileDB-Inc/TileDB/releases/tag/2.16.3. Let us know if it works for you. Thanks, Luc.

DimitrisStaratzis commented 1 year ago

TileDB-Java 0.18.0 has been released to support the latest core version.

chris-allan commented 1 year ago

Thanks to everyone for all the work that went into TileDB-Inc/TileDB#4256! We've successfully managed to upgrade our codebase to 0.18.0 but have noticed a ~10x performance degradation vs. 0.10.1 in our production codebase during the copy phase. This results in a job which takes ~18 minutes with 0.10.1 to take over 3 hours with 0.18.0. It seems to be localized particularly to invoking Query.close() which can take longer than the read or write itself in some cases which is particularly noticeable where we have operations that have to complete serially.

To try and illustrate this for everyone I have updated the torture example with logging and some stop watches where we are seeing issues as of chris-allan/tiledb-java-torture@4211931. You will need to switch debug to true if you want the timings. Here is summary run on Linux against an NVMe SSD:

callan@behemoth:~/code/tiledb-java-torture$ java -jar build/install/*/lib/perf4j-0.9.16.jar debug.log
Performance Statistics   2023-08-23 11:50:00 - 2023-08-23 11:50:30
Tag                                                  Avg(ms)         Min         Max     Std Dev       Count
Query.close()                                            0.1           0           1         0.2        1373
Subarray.close()                                         0.0           0           0         0.0        1373
TileDB.writeBlock()                                     45.2          11         185        37.1        1374

Performance Statistics   2023-08-23 11:50:30 - 2023-08-23 11:51:00
Tag                                                  Avg(ms)         Min         Max     Std Dev       Count
Query.close()                                            1.3           0          89         3.6        3905
Subarray.close()                                         0.0           0           0         0.0        3904
TileDB.writeBlock()                                     24.3          11         112         4.7        3904

Performance Statistics   2023-08-23 11:51:00 - 2023-08-23 11:51:30
Tag                                                  Avg(ms)         Min         Max     Std Dev       Count
DestinationQuery.close()                                 2.3           0          28         1.9        1540
DestinationSubarray.close()                              0.0           0           1         0.0        1540
Query.close()                                            0.5           0           2         0.6         442
SourceQuery.close()                                      5.8           0          97         5.8        1540
SourceSubarray.close()                                   0.0           0           1         0.0        1540
Subarray.close()                                         0.0           0           0         0.0         443
TileDB.readBlock()                                      13.2           4          25         4.4        1540
TileDB.writeBlock()                                     21.6           9          62         5.9        1982

Comparing like with like is hard due to the subarray API differences between 0.10.1 and 0.18.0 but I don't think I can recall a time with 0.10.1 where we ever saw native memory cleanup of Query be millisecond measurable.

I can open another issue if that's preferred. Happy to provide any further information that's helpful.

KiterLuc commented 1 year ago

Thanks for the update @chris-allan. We will start our investigation and see where the time is being spent.

DimitrisStaratzis commented 1 year ago

Hi @chris-allan, could you please give us some more details about the system you are using?

ihnorton commented 1 year ago

(^ especially the number of CPUs)

chris-allan commented 1 year ago

Ubuntu 22.04. 64GB of DDR4. NVMe.

callan@behemoth:~$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 9 5950X 16-Core Processor
    CPU family:          25
    Model:               33
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           1
    Stepping:            2
    Frequency boost:     enabled
    CPU max MHz:         5083.3979
    CPU min MHz:         2200.0000
    BogoMIPS:            6787.90
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 ss
                         e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb s
                         tibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save
                         tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
Virtualization features:
  Virtualization:        AMD-V
Caches (sum of all):
  L1d:                   512 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    8 MiB (16 instances)
  L3:                    64 MiB (2 instances)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-31
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling
  Srbds:                 Not affected
  Tsx async abort:       Not affected

DimitrisStaratzis commented 1 year ago

Hi @chris-allan ,

We've attempted to replicate the issue but so far, we haven't been able to reproduce such a slowdown across versions. We have been conducting tests using your code under various scenarios and with datasets of different sizes.

Here is one representative experiment. (These statistics are for the entire runtime not for ranges like yours)

0.10.1: 17 MINS 26 SECS

Tag	Avg(ms)	Min	Max	Std Dev	Count
SourceQuery.close()	29.8	9	83	7.5	13,200
TileDB.readBlock()	88.4	48	139	9.9	13,200
DestinationQuery.close()	17.9	2	50	8.6	13,200
Query.close()	5.1	0	363	9.6	51,480
TileDB.writeBlock()	23.2	13	387	6.1	64,680

0.18.0: 18 MINS 42 SECS

Tag	Avg(ms)	Min	Max	Std Dev	Count
SourceSubarray.close()	0.0	0	12	0.4	13,200
SourceQuery.close()	32.1	16	217	6.9	13,200
TileDB.readBlock()	109.8	59	451	12.7	13,200
DestinationSubarray.close()	0.0	0	8	0.3	13,200
DestinationQuery.close()	18.8	4	48	7.6	13,200
Query.close()	3.8	0	279	7.9	51,480
TileDB.writeBlock()	21.7	11	378	6.1	64,680
Subarray.close()	0.0	0	0	0.0	51,480

Moreover, we've created a modified version of your code that seamlessly operates with TileDB-Java 0.10.1, and the runtime remains very similar. Our stopwatches also confirm that the Query.close() method consistently takes a few ms in both versions.

It's important to note that all of our experiments were conducted on an Ubuntu EC2 instance with the exact specifications you provided.

To further assist us in addressing this matter, could you share some additional information to help us reproduce the issue? (e.g, how many workers you have in production?)

We are happy to get on a call to discuss in more detail.

Best, Dimitris

TileDB-Inc / TileDB-Java

Concurrency issues after upgrade to 0.11.0 #304

0.10.1: 17 MINS 26 SECS

0.18.0: 18 MINS 42 SECS