API ergonomics surrounding unsigned types

TileDB-Inc / TileDB-Java

Java JNI interface to the TileDB storage engine

MIT License

26 stars 5 forks source link

API ergonomics surrounding unsigned types #177

Open chris-allan opened 4 years ago

chris-allan commented 4 years ago

After working with TileDB, in Java, for a couple of biological imaging projects I wanted to bring up the issue of developer/API ergonomics surrounding unsigned types on the JVM. Specifically, the fact that the type of arrays going in and out of NativeArray instances of unsigned 8 and 16 bit data are of short[] and int[] types respectively. This has also been mentioned previously (#76) in the context of unsigned 64 bit arrays and the corresponding array type consistency (long vs BigInteger).

This typecasting "up" of unsigned arrays has two primary issues in my opinion:

At least a doubling of the in-memory size of the resultant Java array following interactions with TileDB
Conversion overhead when integrating with basic Java I/O and NIO APIs, and other image processing and computer vision APIs from Java such as N5 [1], OpenCV [2], Fiji [3], etc.

Where (1) is concerned, we are regularly processing multi-terabyte arrays and often have several gigabytes of array data in memory during such processing. To double the memory requirements of every array we are working with due to an API choice is at best unfortunate. Furthermore, due to (2) these arrays don't end up lasting in memory very long adding to GC pressure. The basic Java I/O and NIO APIs primarily take byte[] or ByteBuffer. The image processing and computer vision APIs predominantly expect byte[] and type, short[] and type, etc. to be specified separately.

I certainly understand the positions "Java doesn't have unsigned types" and "Java doesn't support array typecasts like C/C++". However, practically the language has added a number of features to the standard library, starting with NIO's ByteBuffer in 1.4 and culminating with the toUnsigned*() method additions in 8, over time to cater to unsigned workflows and at least typecasting "up" from byte[].

This dissonance has lead us to flat out not specify unsigned types explicitly when using TileDB in Java but instead handle them implicitly and use signed types. This in itself has two fairly negative consequences:

Fill values are less than ideal
Implicit typing has to be communicated to other TileDB language runtimes such as Python

(1) can probably be mitigated with a resolution to TileDB-Inc/TileDB#518 being implemented. For biological imaging use cases a non-configurable fill value can be problematic for a number of other reasons. However, the implicit type communication is problematic and definitely complicates cross language use when we are astype()ing numpy arrays everywhere.

I don't know how wedded current users of TileDB in other domains are to the way NativeArray functions. At the risk of sounding drastic I'd like to first understand why we shouldn't (can't?) rely on the caller to handle the unsignedness of the array and use short[] for uint8, int[] for uint16, etc. when working with NativeArray.

Shelnutt2 commented 4 years ago

@chris-allan first, I apologize for the delay in getting back to you here. We appreciate your thorough and detailed issue. Its great to get some feedback on your use case of TileDB-Java.

We understand the problem of the upcasting and the increase in memory usage. Also copying through the JNI, adds additional memory constraints and GC pressure as you highlighted.

After some internal discussions I believe we can make several changes which might help here. I'd like your feedback on seeing if these changes will help your use case.

We can add new overloads to SetBuffer to accept a NIO ByteBuffer in addition to the existing NativeArray overloads
We can switch NativeArray to use NIO ByteBuffer with direct allocation instead of using the c++ arrays like we do today
We can look at adding support for Apache Arrow vectors as an alternative to ByteBuffers for interoperability.

TileDB's c-api deals with void*'s and the raw bytes so it does not need a typed array. As such I believe we can allow the user to pass their own ByteBuffer which we can get the underlying memory from through the JNI. This allows for users to avoid our NativeArray class if they wish to manage everything themselves.

We can also change NativeArray to be backed by an NIO ByteBuffer which is direct allocated. We can keep the same user facing API, so this should not be a breaking changing. Switching to a ByteBuffer in the class would allow us to expose the ByteBuffer, so users can also get that and pass it along to other APIs that take ByteBuffers instead of having to always convert to the java arrays.

Lastly we've been watching the Apache Arrow project and have considered adding support at some level (to be determined) for using arrow vectors. Our interests comes sparks integration with Arrow and our goal to achieve zero copy where possible. Arrow does not directly map to TileDB semantics (i.e. offsets are handled differently) but we could have zero copy for the data buffer and only deal with conversions for offsets.

I believe the ByteBuffer addition and conversion can happen relatively fast, arrow would need a bit more time as we try to find ways to use it without any conversion or copying required. Would the switch to ByteBuffers be helpful in reducing the memory consumption and copying through the JNI?

chris-allan commented 4 years ago

No problem at all, @Shelnutt2. We're all busy.

I think the ability to use a direct allocated NIO ByteBuffer via Query.setBuffer(ByteBuffer byteBuffer) would be very welcome. I expect this was what was in mind when #83 was created. Sorry I didn't catch and mention that previously. You might consider adding Query.setBuffer(byte[] arr) as well if that's not too much trouble as that will cover the cases where API consumers have byte[] in hand and don't want to allocate and copy into a direct allocated NIO ByteBuffer.

On a related note, you might consider expanding the Java type support to include Query.setSubarray() and Query.setCoordinates() to have less copying and less native memory for the API consumer to have to handle. As these values have to come from Java arrays anyway having to then put them into a NativeArray feels counter intuitive. It would also reduce the number of elements in try-with-resources blocks which can already get quite crazy, we have some that are 5 or more objects deep, with the way the current API handles native memory cleanup (see also #84).

I have to admit that I've only taken a superficial look at Apache Arrow in the past and only for IPC. There are to my knowledge a very limited number of projects using the Arrow format across the JNI boundary. Those that are, Gandiva for example, have a lot of hand rolled JNI code to enable them to do what they need to do. We have conceptually a very simple, dense array use case so to me that seems like an awful lot of work and maintenance burden for limited gain.

gsvic commented 4 years ago

Hi @chris-allan,

I hope you are doing well. We've been discussing with @Shelnutt2 about these issues. We currently doing some research on things so we can make good design decisions. However, we are most likely going to start with the implementation of an overload of Query.setBuffer(ByteBuffer byteBuffer) as you mentioned, and start experimenting with it.

The goal is to create an overload setBuffer method that takes that nio ByteBuffer and completely bypasses the NativeArray abstraction.

We will keep you up-to-date about any progress on that. Let us know for any thoughts, we really appreciate the feedback.

gsvic commented 4 years ago

Hi @chris-allan,

We recently added two overloads of setBuffer in Query class. Both support NIO ByteBuffers.

The first one takes as input the attribute name and the number of elements (the buffer capacity) and returns back the ByteBuffer created: https://github.com/TileDB-Inc/TileDB-Java/blob/master/src/main/java/io/tiledb/java/api/Query.java#L482
The second one takes as input the attribute name and an input ByteBuffer. In this case, the ByteBuffer creation is up to the user: https://github.com/TileDB-Inc/TileDB-Java/blob/master/src/main/java/io/tiledb/java/api/Query.java#L523

A snapshot release is available at https://oss.sonatype.org/content/repositories/snapshots/io/tiledb/tiledb-java/0.3.1-SNAPSHOT

Any feedback would be much appreciated.

Best, Victor

chris-allan commented 4 years ago

Thanks for the update, @gsvic. I'll definitely take a look next week if not before.

Have a great weekend!

chris-allan commented 4 years ago

Overall the changes look good.

Just have a few comments:

All the other setBuffer(...) methods are fluent. What was the thinking behind having a ByteBuffer return value?
Is the reason for forced native endianness on the ByteBuffer because of consistency issues surrouding endianness? In the TileDB format specification (https://github.com/TileDB-Inc/TileDB/blob/dev/format_spec/FORMAT_SPEC.md) values are declared as little-endian but does TileDB care about the endianness of the actual array values?
For our use case of very large dense arrays we are not currently using offsets. How important do you think having ByteBuffer versions of these methods, such as setBuffer(String attr, NativeArray offsets, ByteBuffer buffer), is? Do you think it would be a good idea to expand the test cases to also cover the use of setSubarray() combined with use of ByteBuffers or would that just be overkill?

I also added a couple comments to #180 with respect to ByteBuffer conveniences that can be used. Might be worthwhile demonstrating them in the tests so people know they exist.

gsvic commented 4 years ago

Thanks for the detailed feedback @chris-allan and the comments in https://github.com/TileDB-Inc/TileDB-Java/pull/180. I am enumerating the answers below:

The thinking was just the usability. We thought that it would be useful in some cases to return the ByteBuffer back to the user. For example, this implementation of setBuffer takes as input the number of elements only and the buffer is created internally. Thus, if we do not return the buffer, users might not be able to manipulate the data that the buffer holds.
The reason for that was that we faced some conversion issues when we used methods like getInt(), getChar, etc. For example, in my machine, when the data ingested was converted to the proper type using some of the aforementioned methods, the output was invalid due to a different endianness. As a result, unit tests were failing both in CI and some local testing/dev machine. By using the native endianness we were able to resolve such issues.
This is something that we working on and we will open a PR probably by the end of the day today. I can keep you up-to-date for that.

Let us know about your thoughts.

chris-allan commented 4 years ago

The thinking was just the usability. We thought that it would be useful in some cases to return the ByteBuffer back to the user. For example, this implementation of setBuffer takes as input the number of elements only and the buffer is created internally. Thus, if we do not return the buffer, users might not be able to manipulate the data that the buffer holds.

Understood. I think all the setBuffer(...) overloads that currently take a NativeArray expect the array to already be allocated beforehand. To me it feels like removing the setBuffer(String attr, long bufferElements) implementation and keeping the consistent, fluent API would be the preferred option. Allocating a direct ByteBuffer is a fairly expensive operation, its memory is taken out of a different pool, and the garbage collection semantics surrounding them are different. Consequently, I don't think it's a good idea to be allocating them for people and have the possibility of such a method being used in tight loops.

The reason for that was that we faced some conversion issues when we used methods like getInt(), getChar, etc. For example, in my machine, when the data ingested was converted to the proper type using some of the aforementioned methods, the output was invalid due to a different endianness. As a result, unit tests were failing both in CI and some local testing/dev machine. By using the native endianness we were able to resolve such issues.

Understood. In reading more it seems like this has been a long standing issue with direct allocated ByteBuffer instances:

https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-5043362

Considering this I think it would be appropriate to throw an exception (either TileDBError or ApiUsageException) rather than silently changing the byte order.

gsvic commented 4 years ago

Hi @chris-allan,

Thank you for the feedback. We recently merged two PRs.

The first one https://github.com/TileDB-Inc/TileDB-Java/pull/183 includes methods likes the ones you mentioned, for setting ByteBuffers for variable-sized attributes, so one can also set/get the offsets with a ByteBuffer
The second https://github.com/TileDB-Inc/TileDB-Java/pull/185 includes a method for setting a subarray in a Query as a ByteBuffer.

In general, there is some effort in transitioning to ByteBuffers, or, at least provide exactly the same functionality as we do with NativeArrays.

Next, we are planning to add some helper functions in the Query class that will provide a more user-friendly API for data manipulation. Such methods will be datatype-specific functions that will return the proper ByteBuffer (e.g. IntBuffer, LongBuffer etc) according to the attribute datatype.

Let us know about your thoughts/recommendations regarding you use-case.

Best, Victor

chris-allan commented 4 years ago

Happy to help. PRs look good. I will hopefully have some time to test drive them more thoroughly next week.

I think the only piece of API feedback I have on what has already been done is that the use of ByteBuffer for methods such as setSubarray() or offsets of the setBuffer() overload feel less than ideal due to the native byte order requirements of a direct byte buffer. This effectively forces the use of individual put* methods for every insertion. Furthermore, allocating direct byte buffers for methods that may be called in tight loops is unlikely to be what a caller wants for the reasons I gave previously. These arrays are going to be small, is there reason why they cannot be standard Java arrays and their values copied at the JNI level? setSubarray(int[]) and setBuffer(String, int[], ByteBuffer) for example?

For our use case I cannot see using datatype specific functions. We always know our type and it is very easy to just call asIntBuffer(), asLongBuffer(). Sadly the reverse is not possible. Consequently I'd actually see additional data-type specific methods being less user-friendly than more.

Thanks for all the work so far.

Have a great weekend!

chris-allan commented 4 years ago

We now have a public implementation of our work with TileDB-Java exposed in this PR on our bioformats2raw repository:

glencoesoftware/bioformats2raw#57

The new ByteBuffer API and multi-architecture native JAR are very welcome additions!

During this work we came across a few things:

Architecture/distribution support on Linux being restricted to Ubuntu 18.04+ (#191, #193)
Lack of support for strings with the current TileDB-Java metadata API (I will open an Issue/PR for this)
Lack of support for metadata on groups; this is a wider TileDB issue (TileDB-Inc/TileDB-Py#254; https://feedback.tiledb.com/tiledb-core/p/metadata-on-groups)
Fill values are set inconsistently for a variety of types; ex. max value for uint16, min value for int16

I'm aware that configurable fill values will be coming soon with TileDB but here is how things currently stand for 16-bit integers:

https://github.com/TileDB-Inc/TileDB/blob/db4137630512c8dd302776e4fd3cf7f3f10a890f/tiledb/sm/misc/constants.cc#L134-L138

Not sure exactly why these types were not all updated in TileDB-Inc/TileDB@b88b34bc4b24ccffe3441c09100732123eaea5be but the inconsistent defaults definitely seem incorrect. I will mention this on TileDB-Inc/TileDB#518.

joe-maley commented 4 years ago

@chris-allan We agree that the default empty values should be consistent. I have opened a PR to change char and signed integers to fill with their maximum value to match the behavior of unsigned integers: https://github.com/TileDB-Inc/TileDB/pull/1807

Thanks!