Move from ByteBuffer to Memory

leventov commented 7 years ago

The goal of this issue is to agree on basic design principles and expectations from this refactoring.

Goals

Speed up critical places in query processing by skipping bound checks, which ByteBuffer always performs when making any read or write. Bound checks could be hoisted outside the hot loops manually or enabled via assertions/static final boolean flag, that is effectively free with some precautions.
Get rid or 2 GB limitation, use long indexes. See #3743.
Don't make a lot of Object copies when a Memory object has to be "sliced", "converted to read-only", etc. See https://github.com/druid-io/druid/pull/3716#issuecomment-274658045.
+ Eliminate current Endian craziness in Druid by standardizing on native-endian. https://github.com/druid-io/druid/issues/3892#issuecomment-276235357

Design

Based on DataSketches's Memory.
~~It is moved to Druid source tree, not used as a dependency.~~ See https://github.com/druid-io/druid/issues/3892#issuecomment-276589924
Memory object is immutable. "position and limit", if needed, are passed along with Memory as two primitive longs.
Upon creation a mutable Memory object has a cached immutable "view" object (which implements read methods and throws exception from write methods), this object is always returned when "as read only" is needed.
Explicit close() or free() on Memory is possible, but not strictly required, there is a sun.misc.Cleaner safety net, as well as in DirectByteBuffer.
While transition from ByteBuffer to Memory is in progress (~~not expected to be done all at once, but subsystem by subsystem, class by class~~ see https://github.com/druid-io/druid/issues/3892#issuecomment-276548114), and also when we need to interop with external libraries which require ByteBuffer, conversion from Memory to ByteBuffer and vice versa is possible. Likely it requires to
- Use DirectByteBuffer-compatible format of Cleaner inside Memory
- Be able to access private DirectByteBuffer constructors via MagicAccessorImpl.
+ "Global" (immutable) Memory's bounds are checked optionally via assertions/guarded by static final boolean flag, "local" position and limit are check explicitly manually, with helper methods or versions of read and write methods of Memory, which accept read/write position and "local" limits. https://github.com/druid-io/druid/issues/3892#issuecomment-276185704

Objections, additions, corrections, questions are welcome. @leerho @cheddar @weijietong @akashdw @himanshug @fjy @niketh

leventov commented 7 years ago

Unresolved problems

Big-endian format that is currently used in Druid and ByteBuffers?

leventov commented 7 years ago

+ Design

"Global" (immutable)Memory's bounds are checked optionally via assertions/guarded by static final boolean flag, "local" position and limit are check explicitly manually, with helper methods or versions of read and write methods of Memory, which accept read/write position and "local" limits.

cheddar commented 7 years ago

@leventov there's a problem with moving Memory to the Druid source tree, which is that it would then make the current implementations of Sketches not work. They would have to either be re-implemented depending on Druid (which is a huge dependency for them to take) or we would have to have some sort of compatibility layer between the two.

I do think that it is likely for Druid to have some classes that make it easier to use Memory for various operations, which would definitely live in Druid. The "read-only" version seems like something we should get the Memory library to support.

The non-native-endian problem basically means that our initial version of the integration will have to do non-native-endian reads (i.e. convert after reading non-native-endian value). Then, in a subsequent version, we will have to create a new serialization version that is only native endian.

The close/free stuff is already done.

The MagicAccessImpl thing looks cool. Is that basically a thing where we can put any arbitrary address and length in and it gives back a DirectByteBuffer that uses that? If so, that could be useful, yeah.

leventov commented 7 years ago

@cheddar I think sooner or later the proposed Memory should anyway become a part of Druid source, because it is a critical dependency and it's not good to depend on PR review/release cycle done by Yahoo exclusively, not Druid community. This is the same story as with java-util/bytebuffer-collections/extendedset and Metamarkets. Sooner, i. e. from the very beginning, is better.

Another solution is to move DataSketches lib to github.com/druid-io.

We could also add DataSketches's Memory <-> Druid's Memory conversion. Since Druid's Memory is supposed to be based on the DataSketches's one, it should be easy.

What do people from Imply (@gianm, @fjy) think on this question?

The MagicAccessImpl thing looks cool. Is that basically a thing where we can put any arbitrary address and length in and it gives back a DirectByteBuffer that uses that? If so, that could be useful, yeah.

I think it is needed for being able to make a refactoring gradually, not all-or-nothing. But it is super-ugly, likely won't work on any JDK except OpenJDK/Oracle JDK, likely won't work on Java 9 at all. So better it to be a temporary hack. If there are huge libraries which Druid depends on and they require ByteBuffer, it could also be a problem, because it means that the hack couldn't be removed.

leerho commented 7 years ago

@leventov @cheddar ,

+ Goal

Eliminate current Endian craziness in Druid by standardizing on native-endian.

cheddar commented 7 years ago

Fwiw, I think it is better to try to do an all-or-nothing swap of the libraries and just make sure that we support the same persisted data. It's a much larger PR and will be a significant QA/code review risk, but this is also something that I think we either want to take whole-hog or not at all.

I propose we bring this up at the dev sync tomorrow, I'll make sure to mention it.

leerho commented 7 years ago

@leventov @cheddar Interesting discussion. I just have a few comments

"Don't make a lot of Object copies when a Memory object has to be "sliced", ...". The MemoryRegion(R) is a view of its parent as "slice" is for ByteBuffer; no data is copied.

If the concern is the over-creation of these shell objects, with Memory you could create a read-only view of the parent once, and then just pass the offset and permitted length to the downstream users of some sub-chunk; no object creation at all. This is a little messy for the downstream users but safe. With BB this is not possible, because the position, limit, etc are always writable, which could screw up the parent.

MagicAccessorImpl. I've been looking at this and it concerns me.

It brings in yet another hierarchy of dependencies (org.objectweb.asm), which could make maintenance & debugging rather challenging.
Creating a DirectByteBuffer requires calling the internal Bits.reserveMemory() and unreserveMemory(), which call the sun.misc.SharedSecrets() that keeps track of all created instances of DirectByteBuffer so that they can be properly freed. I don't see any mechanisms in the MagicAccessorImpl that does this. Of course, I may not fully grok how it works :).

Using the JVM Cleaner. Although this might be possible, the Java gurus that I have spoken with, who work on the internals of the JVM, advise me to never use Cleaner as it can be quite dangerous (and hazardous to your mental health :) ). This and the above point are why I don't recommend trying to create a DirectByteBuffer from a Memory. Creating a HeapByteBuffer from Memory is trivial and can be done with the current code.
Memory already has read-only implementations of NativeMemory and MemoryRegion.
Memory has no dependencies outside Java, and you can think of it as a wrapper around Unsafe. This being the case, it would be possible to create classes in Memory that leverage the same UnsafeUtil and include position, limit, rewind, etc. But, is this really necessary? Being able to read old data should not be a problem, but using Memory is a different programming paradigm that new downstream applications would need to get used to.
Memory was designed as a fast mechanism for random access of different primitive types sharing the same backing memory, similar to a C struct and union combined. ByteBuffer was designed for fast stream-based IO buffer operations, for which it does a splendid job. Reading a TCP/IP packet stream cannot be read with a random access paradigm so there is no choice. Reading data from RAM can be read either in a random access paradigm or sequentially, which depends on the application. But it seems to me to be overly restrictive for Druid to require that downstream processes must use a sequential paradigm in order to process their data out of RAM. Processing sketch data structures is a good example that requires random access, but I'm sure there will be many others.

Converting Druid away from ByteBuffers will be a lot of work no matter what path is chosen.

gianm commented 7 years ago

@leventov On the question of whether Memory should be part of Druid or not, my thought is that it is weird for Druid to depend on sketches-core, but if Memory stuff was broken out into its own library that both Druid and sketches-core used, then that could be fine. To me the decision of incorporating anything into the Druid tree or using it as a dependency should be made based on:

Do we think we'll want to make changes to the library?
Does the library have other users/contributors that we want to share development efforts with?

IMO, if the answer to those questions are "yes" and "no", respectively, for a given library then we should incorporate it into Druid. If the answers are anything else then it's better to keep it separate. I think for extendedset, bytebuffer-collections, and large portions of java-util, Druid was the only user so it made sense to merge them in. Memory has at least one other active user (datasketches) so to me that suggests including it as a dependency.

If that causes problems for Druid development we could always revisit.

gianm commented 7 years ago

@leerho Could you elaborate on what's wrong with Cleaner? It's been added recently to a few places in the Druid code base and if it's scary I'd like to understand why.

gianm commented 7 years ago

@cheddar

Fwiw, I think it is better to try to do an all-or-nothing swap of the libraries and just make sure that we support the same persisted data. It's a much larger PR and will be a significant QA/code review risk, but this is also something that I think we either want to take whole-hog or not at all.

Would you also propose changing factorizeBuffered to take Memory on the aggregators?

cheddar commented 7 years ago

@gianm yes, I would propose that the BufferedAggregators essentially become MemoryAggregators. A good amount of the speed gains that we can expect from the change will come from the intermediate processing layer, which means that the aggregators would have to change.

This will have wide-reaching affects on anyone that has done their own aggregators as extensions, but I think that it is the only way to actually make the change (and that's part of the risk of the change and why I think it will take a while to get it actually accepted)

gianm commented 7 years ago

Sounds like it's time to make the 0.11.0 milestone in github then!

weijietong commented 7 years ago

Does that mean a mmap call will return the Memory , aggregators will base on the Memory too ?

cheddar commented 7 years ago

@weijietong yes, that is the proposed change. We discussed the process and we are currently exploring two phases

convert the "persist and read" paths to be based on Memory. This might affect ComplexMetricSerde but should not affect the aggregators
convert the query processing, this will affect aggregators

We will try to complete phase 1 entirely before phase 2. The Memory interface should then also provide the interface required to integrate with other systems (like a distributed FS) if that is a requirement. We can try to guess at the right "factory" interface to enable slotting in other Memory implementations as an extension point.

leerho commented 7 years ago

@gianm

"Could you elaborate on what's wrong with Cleaner? It's been added recently to a few places in the Druid code base and if it's scary I'd like to understand why." I just found out that our corp email was screwing up and has not been sending my replies :( !!I have a hunch that a number of my replies have been going into a bit-bucket!So here is the reply I sent at noon today.

Creates another dependency on a non-public internal set of classes of the JVM.- It is not clear to me that Cleaner is essential, or at least it hasn't been essential so far.

More: There is no way to achieve from Java the performance attained using Unsafe and this is why it is so widely used. Cleaner, on the other hand, is a convenience in that it allows programmers not to have to worry about freeing resources that they have allocated.

Overusing Cleaner can cause contention, and block GC operations. See https://srvaroa.github.io/java/jvm/unsafe/2016/08/28/sun-misc-cleaner.html.

There are other attempts to create public Cleaners, but from what I have read, with varied success. DirectByteBuffers (DBBs) are created with a messy process inside the JVM using VM calls, Bits.reserveMemory(), which calls an internal SharedSecrets() that tracks all instances of DBB. and of course Cleaner.

I do not recommend attempting to create your own DirectByteBuffers outside the JVM. And I don't recommend attempting to create a DBB from a Memory object. Create DBBs sparingly, and reuse them if you can.

On Tuesday, January 31, 2017 9:38 AM, Gian Merlino <notifications@github.com> wrote:

@leerho Could you elaborate on what's wrong with Cleaner? It's been added recently to a few places in the Druid code base and if it's scary I'd like to understand why.— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

weijietong commented 7 years ago

Very glad to see this process, I think this will be a great huge refactor job to Druid，but it's valuable.

Btw , the coding convention of Druid ,with too much anonymous inner class 、functional programming, makes codes verbose 、unreadable. Guava has officially announced the Caveats . Functional programming is the language like scala which is good at . If you accept this opinion, maybe we can refactor the indexing and querying part of the codes by the way at this milestone.

@leerho I think Memory may implement a reference count memory recycle algorithm like what Netty
does , not to depend on the cleaner of JVM.

leventov commented 7 years ago

@leerho @weijietong

On `Cleaner`

I don't see real options.

Mandatory manual close() / free(). I want to write database in Java, not C++98. Memory leaks, use-after-free guaranteed. It works (only with very experienced teams and if a lot of resources invested into code quality) in C++ projects because C++ developers are disciplined and have a lot of experience dealing with pitfalls of manual memory management. We are not.
Mandatory manual reference counting = shared_ptr. I don't want to write database in C++11 either. Seriously, I had experience developing "better ByteBuffer frameworks" 3-4 times, the last one is https://github.com/OpenHFT/Chronicle-Bytes. Reference counting is implemented there. My experience from this attempt: reference counting doesn't work in Java. Fortunately Chronicle Bytes also use Cleaners as safety net, so manual reference counting is not mandatory with Chronicle Bytes.
It might work in Netty, but keep in mind that Netty is a more narrowly scoped framework, there are a few patterns how buffers are used there. I'm not sure they have to share buffers between concurrent threads, we have(?) in groupBy. Nevertheless I see messages in @normanmaurer's Twitter about finding another memory leak in Netty almost every single weak.
finalize() - a slower alternative to Cleaner, which causes GC problems.
Do you know other options?

IMO DirectByteBuffer's approach with using Cleaner and not requiring manual memory management is just right for Java. The two mistakes that they made are

No "public" way to voluntary "free" the buffer before it is reclaimed by GC, however everybody use ((DirectByteBuffer) bb).clean().
It doesn't implement AutoCloseable, so couldn't be used in try-with-resources statement. This is why we have to add wrappers like MappedByteBufferHandler

But those mistakes could be easily fixed in Memory.

Creates another dependency on a non-public internal set of classes of the JVM.- It is not clear to me that Cleaner is essential, or at least it hasn't been essential so far.

It is already a dependency.

Overusing Cleaner can cause contention, and block GC operations. See https://srvaroa.github.io/java/jvm/unsafe/2016/08/28/sun-misc-cleaner.html.

At least it won't be worse than now, because every DirectByteBuffer uses Cleaner internally. There is no reason why we should use more "native" Memory objects, than DirectByteBuffer objects now. Likely we can use less Memory objects, because there is no 2 GB limitation in Memory and some distinct ByteBuffers used now probably could be coalesced.

Another advantages of using Cleaners:

Ability to interop (zero-copy conversion) Memory <-> ByteBuffer, that could appear to be needed even if we take all-or-nothing refactoring approach
This allows to use Memory just as we currently use ByteBuffers without regression. It will greatly reduce the risk and size of the refactoring.

leventov commented 7 years ago

@cheddar @gianm

On `Memory` in Druid source

Do we think we'll want to make changes to the library?

Yes during the initial refactoring and experimentation, when the refactoring is complete, maybe very rarely.

So I think the approach with breaking Memory into a small library works, if Yahoo guys change it during the refactoring so that there is complete agreement on it's design and implementation.

I think for extendedset, bytebuffer-collections, and large portions of java-util, Druid was the only user so it made sense to merge them in.

Most important reason is that we want to make changes in those libs, which affect Druid code as well, i. e. "cross-lib refactoring". This is less likely for Memory, again, if it is reviewed thoroughly during the refactoring.

leventov commented 7 years ago

@leerho

Closer look at `Memory` design

`NativeMemory`

I'm concerned about offset -> address conversion with using two instance fields: nativeRawStartAddress_ + objectBaseOffset_ + offsetBytes There should be a way to avoid this cost in critical places. Hotspot JIT itself is not very good and reliable at moving field reads outside loops, even if the fields are not changed, even if the fields are `final`. I suggest to make those offsets part of `offsetBytes` value, passed to every method. It makes `Memory` effectively stateless, if assertions (or a special boolean flag, controlling bound checks) are off. Other approaches: - Use `getAddress()` and raw `Unsafe` bypassing `Memory`. Downsides: - Losing assertion-enabled bound checks - Explicit dependencies on `Unsafe` in more parts of the code. - Add a set of `getXxxByAddress()` and `putXxxByAddress()` methods to `Memory`, i. e. wrap bound check and `Unsafe` call. Downside: more methods in the interface, which is easy to confuse unintentionally.
Mixing Unsafe get/put with some object (memArray_ is not null) and with "raw" address (memArray_ is null) at the same site may have negative performance implications. To be tested.

`MemoryRegion`

Things I'm concerned about

Mutability via assign().
volatile fields.
Delegation to "abstracted" base mem_.

I suggested instead of MemoryRegion, where it is used, to create another copy of NativeMemory or NativeMemoryR with adjusted bounds.

leerho commented 7 years ago

@leventov Thank you for your comments!

"Mandatory manual close() / free(). I want to write database in Java, not C++98. ... C++11". I don't want you to have to write in C++ either.

But if you read some of the visceral comments coming from the Oracle-dominated JCP and even some of the authors of the JEPs submitted to the JCP, we (you and I) are not writing in "Java" either as we have broken the rules by writing "... libraries and applications that, knowingly or not, make use of these non-standard, unstable and unsupported internal APIs." and "If you want to perform dangerous manual memory allocations, there are other languages for that." This is scary language by folks that have major influence on the future direction of Java. (There may be some glimmers of hope in the latest JEP 260 proposal, but we will have to wait and see.)

Clearly, we are writing Java code in a grey area where we are trying to get as close to the metal as possible and still produce robust and maintainable systems. I am open to morphing Memory, within reason, respecting current backward compatibility issues, into what would also be acceptable to the Druid designers.

finalize(): The Memory use of finalize() is pretty benign and narrow (but I'm open to other solutions). Currently, any direct memory allocation via AllocMemory(), does not actually rely on finalize() directly to free memory. Instead, if the JVM calls finalize() an exception is thrown informing the developer that he/she did not properly freeMemory(). The only instances of Memory that even expose the finalize() method are those that actually used AllocMemory() or MemoryMappedFile() to create native, off-heap memory. This was done intentionally to reduce the number of finalize objects that the JVM has to track. My understanding from @cheddar is that these are required only in a few places in Druid, and the actual number of these allocations is not huge.

It is an assumption, on my part, that the number of developers who actually have to write the code that creates and frees native memory in Druid is a small number of skilled Java developers who could quickly learn the patterns doing these allocations correctly. The number of developers that receive views of Memory for reading /writing of data can be much larger. These folks will not have the burden of worrying about freeing memory and even if they did attempt to free memory it results in a no-op.

I have more comments on some of your excellent following posts to the one above. But I have to break away for a few hours.

leventov commented 7 years ago

@leerho

Two recent PRs (which @gianm mentioned): StupidPool fixes and it's follow-up specifically move away replace finalize() to using Cleaners. Applying this change in Metamarkets's production led to 50% less GC CPU, less pauses and pauses shorter:

0.9.2-mmx-19 is the version where "StupidPool fixes" PR was backported.
This improvement happened in two pre-Christmas weeks, which are as busy as previous weeks. Extra decrease in the Christmas week is because of reduced load.

In your previous messages you raise concern about Cleaner's scalability. I agree it's not ideal and the cost is not trivial, but finalize() is even less scalable and slower.

Because of the mistake in the finalize() design that allows to "resurrect" objects during the finalize(), Objects with finalize() overridden couldn't often been reclaimed during the specific GC pause/phase even if unreachable, that often leads to OutOfMemoryErrors. After this "StupidPool fixes" backport we see much less OOMs in production: before there were several OOMs each week, after - a few (maybe less than 5) OOMs during a month and a half.

But if you read some of the visceral comments coming from the Oracle-dominated JCP and even some of the authors of the JEPs submitted to the JCP, we (you and I) are not writing in "Java" either as we have broken the rules by writing "... libraries and applications that, knowingly or not, make use of these non-standard, unstable and unsupported internal APIs." and "If you want to perform dangerous manual memory allocations, there are other languages for that." This is scary language by folks that have major influence on the future direction of Java. (There may be some glimmers of hope in the latest JEP 260 proposal, but we will have to wait and see.)

Cleaner is "less" an internal and unsupported API than say MagicAccessorImpl and SharedSecrets: sun.misc.Cleaner is standardized in Java 9 as java.lang.ref.Cleaner as a replacement for finalization.

I don't agree with the argument that "Cleaner is so hard to use right that only Java gurus and JDK authors should be allowed to do that."

We already use Cleaner successfully
In Java 9 it is exposed to the public, so considered by JDK authors to be comprehensible and safe enough to be used by all Java developers.

leerho commented 7 years ago

@leventov Ok, given that Cleaner is getting promoted, I'll concede. And if we need to migrate Memory to Cleaner, we will do it.

I want to return to your "Closer look at Memory". Thank you for your critique, your suggestions are very welcome!

First the 3 MemoryRegion issues.

The volatile modifiers I had already decided to remove.
Making MemoryRegion immutable is also possible, but there are some use cases, for example, where having a mutable version eliminates the need for the JVM to create millions of objects.
Removing the "delegation" aspect of the MR methods could also be fixed, but in my own testing I have found that JIT does a pretty good job of eliminating method call layers.
NativeMemory, offset -> address conversion issues

nativeRawStartAddress_ and objectBaseOffset_ play an important role in communicating what the backing memory actually is and how to properly set up the correct addresses for Unsafe. There is a little state table at the top of NativeMemory that illustrates the different values of these variables and the states of Memory.

The most dramatic speedup is achieved when the method called is static final. For the cases where speed is absolutely critical I create the static methods required where I pass what I call the memObj, and memAdd which are derived from a Memory instance. And in the most critical cases, where I need to access several such methods, rather than relying on class variables, I will bring them into the the top of the method to increase the probability that they end up on the stack. As an example, see lines 135-150. My point here is not to recommend that everyone do this, but that it is possible to do this from a Memory instance when required.

leventov commented 7 years ago

@leerho thanks.

Making MemoryRegion immutable is also possible, but there are some use cases, for example, where having a mutable version eliminates the need for the JVM to create millions of objects.

Sounds reasonable, let's keep it mutable.

Removing the "delegation" aspect of the MR methods could also be fixed, but in my own testing I have found that JIT does a pretty good job of eliminating method call layers.

In my experience it JIT makes inlines methods well but keeps artifact checks in the generated assembly, i. e. class checks. Making MemoryRegion delegation-free helps with it.
Also it should be pretty easy to implement without need to change the code that uses MemoryRegion.
Also consider implicit mutability problem, if MemoryRegion wraps another MemoryRegion, and the base is changed, so the derived MR is also implicitly changed. IMO it's safer and easier if we explicitly state that created MR borrows bounds and backing storage of the base MR at the moment of creation and future changes to the base MR doesn't affect the created MR.

nativeRawStartAddress_ and objectBaseOffset_ play an important role in communicating what the backing memory actually is and how to properly set up the correct addresses for Unsafe. There is a little state table at the top of NativeMemory that illustrates the different values of these variables and the states of Memory.

I understand this. My proposal is to incorporate those offsets into the primitive long offsetBytes, passed around everywhere. I. e. in each Memory, 0 is not a valid offsetBytes, bytes offsets between Memory.start() and Memory.end() (example names) are valid. start() and end() are determined by each Memory depending on the backing storage. I. e. for off-heap allocated Memory, .start() returns the allocated address. For byte[]-backed Memory, .start() is ARRAY_BYTE_BASE_OFFSET. Then, in all getXxx() and putXxx() no conversion is needed, only assertion-enabled bound checks and call to Unsafe with the given offsetBytes.

Or in other words, make what is currently returned from getCumulativeOffset(offsetBytes) the offsetBytes.

We took this approach in Chronicle Bytes' BytesStore and it worked out well.

> The most dramatic speedup is achieved when the method called is static final. For the cases where speed is absolutely critical I create the static methods required where I pass what I call the memObj, and memAdd which are derived from a Memory instance. And in the most critical cases, where I need to access several such methods, rather than relying on class variables, I will bring them into the the top of the method to increase the probability that they end up on the stack. As an example, see [lines 135-150](https://github.com/DataSketches/sketches-core/blob/master/sketches/src/main/java/com/yahoo/sketches/theta/DirectQuickSelectSketch.java). My point here is not to recommend that everyone do this, but that it is possible to do this from a Memory instance when required. This is what I described as an option [above](https://github.com/druid-io/druid/issues/3892#issuecomment-276595444): - Use getAddress() and raw Unsafe bypassing Memory. Downsides: - Losing assertion-enabled bound checks - Explicit dependencies on Unsafe in more parts of the code. And I can add to this list the concern of mixing access with null and non-null as `memObj`, that I mentioned above. It may turn or not to be a performance problem, but if it will, you should add x2 more boilerplate to the code like you referenced. Or it could be encapsulated in `Memory` by making two different impls: `OffHeapMemory` and `HeapMemory`. `ByteBuffer` also does this by splitting `DirectByteBuffer` and `HeapByteBuffer`, that could also be a sign that expressing them with a single class is suboptimal.

leerho commented 7 years ago

@leventov This is all very substantive feedback, thank you!

Also consider implicit mutability problem, if MemoryRegion wraps another MemoryRegion, and the base is changed, so the derived MR is also implicitly changed. IMO it's safer and easier if we explicitly state that created MR borrows bounds and backing storage of the base MR at the moment of creation and future changes to the base MR doesn't affect the created MR.

Good point! Fortunately hierarchical MR are not that frequent. But you are right that once created, the leaf node MR should directly reference itself back to the backing store and not its MR parent.

Use getAddress() and raw Unsafe bypassing Memory. Downsides: Losing assertion-enabled bound checks Explicit dependencies on Unsafe in more parts of the code.

Yep! But I see no way around these for truly critical speed sections. We just have to try to clearly identify and document these sections and limit the number of them.

And I can add to this list the concern of mixing access with null and non-null as memObj, that I mentioned above. It may turn or not to be a performance problem, but if it will, you should add x2 more boilerplate to the code like you referenced. Or it could be encapsulated in Memory by making two different impls: OffHeapMemory and HeapMemory.

ByteBuffer also does this by splitting DirectByteBuffer and HeapByteBuffer, that could also be a sign that expressing them with a single class is suboptimal.

Just because it was done in BB, doesn't by definition make it optimal. I'm not very impressed with the ByteBuffer architecture. The ByteBuffer hierarchy under the covers explodes into nearly 100 classes which had to be machine generated and vary hugely in their performance and all sharing inadequate and clumsy APIs. (e.g., HeapBB R/W of longs are about 8X slower than a Heap LongBuffer R/W of longs! No excuse!) Which may have been ok for reading and writing packet streams, but clumsy for anything else.

Let's discuss strategy. We have discussed a number of things:

The implicit (hierarchical) mutability issue needs to be fixed and should not be that difficult.
We agree to keep MR mutable.
The two class vars nativeRawStartAddress_ and objectBaseOffset_: If I understand your concern, it performance and that it requires two accesses of class vars instead of one. I'd have to creates some tests of this, but I have a hunch that the speed improvement would be minor compared to the penalty of the methods not being static.

I have to run now, but lets try to make a list :)

leventov commented 7 years ago

@leerho

Yep! But I see no way around these for truly critical speed sections. We just have to try to clearly identify and document these sections and limit the number of them.

I suggested a way that is likely as fast as static method approach, if Memory is a local variable in a loop, JIT will likely pull it's class check out of the loop: making "CumulativeOffset" the offsetBytes. It's explained in details in my previous comment.

It's important to make methods on Memory abstraction as fast as possible, not just provide workarounds. Consider BufferAggregator.aggregate(). We don't want all buffer aggregators to use address and Unsafe directly. The method signature should be changed to aggregate(Memory mem, long position). So we want trivial operations with Memory in BufferAggregator.aggregate() impls to be fast. Extracting Memory.getCumulativeOffset(), Memory.array() from Memory in a trivial aggregate() which makes just two operations with Memory: one get and one put (e. g. see DoubleMaxBufferAggregator) makes little sense.

AshwinJay commented 7 years ago

Flink does something similar (you may already know):

I'm surprised that you'd not want to use Netty's allocator. Are the leaks that bad or that widespread? There are some goodies in addition to the allocator:

Presto:

(Pardon the intrusion, but I'm trying to understand why almost every big open source Java project ends up creating its own Unsafe based memory manager)

niketh commented 7 years ago

@AshwinJay We aren't creating yet another unsafe based memory manager, instead we are borrowing these from DataSketches Memory lib.

@leventov On the discussion regarding cleaner vs using free, here is my take:

Memory is supposed to be allocated in large chunks so we allocate very few times
We have clear entry exit points for where Memory is being allocated (1-2 places in the whole code base) and this is the correct programming model for Memory.
We already have a finalize in Memory incase someone forgets to free explicitly. Finalize will be slower, but it is very infrequent.

Do you see value in having us use cleaner for Memory? Would love to know your thoughts.

leventov commented 7 years ago

@niketh

ByteBuffer.allocate() is used 152 times, allocateDirect() - 6 times, Files.map() - 9 times.

We already have a finalize in Memory incase someone forgets to free explicitly. Finalize will be slower, but it is very infrequent.

The problem is OOMs because of inability to reclaim finalizable objects on a given GC phase. Allocating larger objects only makes it worse.
Currently StupidPool of ByteBuffers allocates as many ByteBuffers as ResourceHolders, the effect of changing that resource holders from using finalize() to Cleaner is big. It was N ByteBuffer (Cleaner) + N ResourceHolders (finalize()). It is changed to N ByteBuffers (Cleaner) + N ResourceHolders (Cleaner). If you will change to N Memory (finalize()) + N ResourceHolders (Cleaner), I expect the effect will be just opposite to the shown above.

leventov commented 7 years ago

@niketh already having finalize() is not an argument, IMHO, because changing from finalize() to Cleaner and back is a 100 line change with zero implications on the rest of the codebase.

niketh commented 7 years ago

@leventov Makes sense, We will make Memory use cleaners instead of finalize.

leerho commented 7 years ago

@leventov

The "mutable MemoryRegion problem" has been fixed. I decided that a mutable MemoryRegion was not worth this risk, so I made it immutable. In looking at where we had used the mutability it was basically an end-around to not having a relative position and it was easy to convert these cases to use an immutable MR. @cheddar and I discussed this and we think that what would be more useful would be a true "positional" parallel interface. It is open for discussion how this would be implemented.
NativeMemory, offset -> address conversion issues. I am still looking at this.
"Removing the "delegation" aspect of the MR methods" Holding off on this until later.
Meanwhile @niketh is busy implementing Cleaner.

leventov commented 7 years ago

@leerho thanks.

I made some benchmarks, which are similar to Druid's TopN double aggregations: https://github.com/leventov/zero-cost-memory

Benchmark                                             (alloc)  Mode  Cnt   Score   Error  Units
ZeroCostMemoryBenchmark.processByteBuffer              direct  avgt    5   4.648 ± 0.106  ns/op
ZeroCostMemoryBenchmark.processByteBuffer                heap  avgt    5  15.783 ± 0.504  ns/op
ZeroCostMemoryBenchmark.processNoOffsetMemory          direct  avgt    5   3.004 ± 0.017  ns/op
ZeroCostMemoryBenchmark.processNoOffsetMemory            heap  avgt    5   2.988 ± 0.092  ns/op
ZeroCostMemoryBenchmark.processNoOffsetTwoImplMemory   direct  avgt    5   2.991 ± 0.088  ns/op
ZeroCostMemoryBenchmark.processNoOffsetTwoImplMemory     heap  avgt    5   3.009 ± 0.138  ns/op
ZeroCostMemoryBenchmark.processTwoFinalOffsetMemory    direct  avgt    5   3.675 ± 0.103  ns/op
ZeroCostMemoryBenchmark.processTwoFinalOffsetMemory      heap  avgt    5   3.700 ± 0.164  ns/op
ZeroCostMemoryBenchmark.processTwoOffsetMemory         direct  avgt    5   3.696 ± 0.116  ns/op
ZeroCostMemoryBenchmark.processTwoOffsetMemory           heap  avgt    5   3.704 ± 0.130  ns/op

Quick takeaways:

Having no Memory-internal offsets makes difference - it's faster.
Making fields final doesn't make any difference.
Separating "direct" and "heap" impl doesn't make difference.

Will investigate assembly and verify results later.

leventov commented 7 years ago

@leerho

one-offset memory is as good as no-offset:

ZeroCostMemoryBenchmark.processOneOffsetMemory   direct  avgt    5  3.008 ± 0.087  ns/op
ZeroCostMemoryBenchmark.processOneOffsetMemory     heap  avgt    5  3.017 ± 0.165  ns/op

Given this, and that moving from 0-indexed ByteBuffers to no-offset Memory (start()-indexed) is risky because requires to change carefully a lot of code, while moving to 0-indexed Memory is very straightforward (almost find-replace), I suggest to implement Memory with a single internal offset field.

leerho commented 7 years ago

Interesting. I'm ready to turn in for the night, but I did a lot of thinking about this this weekend. Also I have some benchmarks I have created as well. I need to do some updates and will share them with u tomorrow.

leerho commented 7 years ago

@leventov

I (think) I am done with the first phase of changes to Memory.

Cleaner implemented
NativeMemory can now be constructed with any of the 8 valid primitive arrays.
NativeMemory computes a cumBaseOffset = nativeBaseOffset + objectBaseOffset. This means only one instance variable is accessed for the get/put methods.
Rewrote a most of the key related javadocs (assuming I could find them all)
We found a "bug" in the original NativeMemory(byteBuffer) constructor in that if the BB was RO the resulting NM was not! The only way around this was to create a factory method that detects this and then constructs either a NM or a NMR.

The next thing I want to look at is simplifying the MR so that it extends NM rather than Memory. This may eliminate the need for MR or most of its code anyway. I need to figure out a way make this change backward compatible. Not sure how yet. Thoughts?

The other more radical thought I had would be to convert Memory into an abstract class that NM extends rather than implements. This would be a huge change that would definitely break Backward Compatibility. It also reduces future flexibility as it becomes a single inheritance tree. I really don't like this idea. Thoughts?

leventov commented 7 years ago

@leerho let's focus on the interface first, internals could be changed later.

1.

The other more radical thought I had would be to convert Memory into an abstract class that NM extends rather than implements. This would be a huge change that would definitely break Backward Compatibility. It also reduces future flexibility as it becomes a single inheritance tree. I really don't like this idea. Thoughts?

I actually like the idea and suggest just to do this.

It allows to add static factory methods right to Memory, instead of a separate Memories.
It allows to share some logic in the most natural way. E. g. I'm sure we will need to implement putUtf8(String) in Memory. Can do that right in Memory instead of separate MemoryUtils and calling it from each Memory implementation.

So there is an abstract class Memory with static factory methods for creating different kinds of Memory all other classes in the com.yahoo.memory package are package-private.

About breaking compat: IMO better to break compat early by hiding all constructors behind static factory methods that allows great flexibility later (add/remove implementations). Than to realise that we want to break compat later, after Druid is refactored to use Memory.

Make Memory Closeable or AutoCloseable to allow try-with-resources.
Move PositionalMemoryRegion suggested by @niketh from #3915 to the same package.

About implementations

The next thing I want to look at is simplifying the MR so that it extends NM rather than Memory. This may eliminate the need for MR or most of its code anyway. I need to figure out a way make this change backward compatible. Not sure how yet. Thoughts?

Since MemoryRegion is unmodifiable now, and there will be PositionalMemoryRegion, I'm losing the point of MemoryRegion.

Actually I think there are too much Memory implementations. I think everything could be implemented using just one: NativeMemory. There could be a ReadOnlyMemory super-interface of Memory, and to obtain a "read-only view" of Memory, we just cast to ReadOnlyMemory.

leerho commented 7 years ago

The memory architecture kind of grew organically from something quite simple, all the while trying to add functionality while maintaining backward compatibility. It is admittedly kind of unwieldy now and needs a complete refresh. There are some other urgent reasons I need to publish a new release of the library and the complete reworking of the memory module will take some time. So I will likely do a new release, then we can work together on what the new Memory architecture and interfaces need to look like. Let me put some thoughts into a new architecture and I'll pass it by you.

I have thought about adding Strings, but I have resisted because it is the start of a slippery slope. Strings could be done now by converting the string to a byte[] or char[] first using the encoding of your choice.

leventov commented 7 years ago

@leerho

The memory architecture kind of grew organically from something quite simple, all the while trying to add functionality while maintaining backward compatibility. It is admittedly kind of unwieldy now and needs a complete refresh. There are some other urgent reasons I need to publish a new release of the library and the complete reworking of the memory module will take some time. So I will likely do a new release, then we can work together on what the new Memory architecture and interfaces need to look like. Let me put some thoughts into a new architecture and I'll pass it by you.

Sure, there is no hurry for releasing the new API.

I have thought about adding Strings, but I have resisted because it is the start of a slippery slope. Strings could be done now by converting the string to a byte[] or char[] first using the encoding of your choice.

Yes, it is possible, but my heart bleeds when I see how inefficient it is. 3-4 unnecessary data copies, creating a lot of garbage objects, ThreadLocal access, comparison of unrelated Strings. And it is done not only during segment construction, it could be done thousands of times in the Druid historical query path, if the query uses DimensionSelector.lookupName() to build a BitSet.

Here is my API proposition:

abstract class Memory implements AutoCloseable {
  static WritableMemory allocate(long size);
  static WritableMemory allocateDirect(long size);
  static Memory wrap(ByteBuffer);
  static WritableMemory wrapForWrite(ByteBuffer);
  static Memory map(File);
  static Memory map(File, long pos, long len);
  static WritableMemory mapForWrite(File);
  static WritableMemory mapForWrite(File, long pos, long len);

  PositionedMemoryRegion region(long start, long limit);

  ByteBuffer toReadOnlyByteBuffer();

  .. get methods

  void close();
}

abstract class WritableMemory extends Memory {
  WritablePositionedMemoryRegion regionForWrite(long start, long limit);

  .. put methods
}

interface PositionedMemoryRegion extends AutoCloseable {
  Memory memory();
  long position();
  PositionedMemoryRegion position(long position);

  .. get methods without offset parameter

  void close();
}

interface WritablePositionedMemoryRegion extends PositionedMemoryRegion {
  WritableMemory memory();

  .. put methods without offset parameter
}

leerho commented 7 years ago

@leventov

This is very similar to what I was thinking. I am working out the low level details of what offset & address values to keep and where in the hierarchy.

Some subtle differences: You seem to prefer having two primary classes: a "read-only" Memory and WritableMemory. But the Memory wrapForWrite(ByteBuffer) seems inconsistent and also would throw an exception if the given BB is read-only. I would prefer having one public Memory from which one could create the read-only and writable versions. In the case of BB, we could have:

static Memory wrap(ByteBuffer);  //results in a RO `Memory` if ByteBuffer is RO.
static Memory wrapForReadOnly(ByteBuffer) //results in a RO `Memory` regardless of the state of ByteBuffer.

We would have to do something similar for mapped files.

Question about AutoCloseable: Other than being recognized by try-with-resources construct, are there any other JVM costs associated with this? Cleaner keeps a linked-list of all Cleaners, for example. More specifically, are there any hidden costs associated with having the heap-associated parts of the Memory hierarchy AutoCloseable, when they clearly don't need it?

Question about "positional" capabilities: Do we need all the capabilities now present in Buffer? Or a subset? I think we might regret doing a subset.

leventov commented 7 years ago

@leerho

Separation between "readable" and "writable" interfaces has proven to be a good move. Most recently designed (buffer) APIs have it, e. g. in Chronicle Bytes: ReadCommon and WriteCommon, in Argona: DirectBuffer and MutableDirectBuffer. Also a bit unrelated, but Kotlin has separate read-only List, Set, Map, Collection and MutableList, MutableSet, MutableMap and MutableCollection interfaces.

This is like const modifier in C++. Without separation, contracts of methods that take buffers as parameters are always unclear: could this method modify the buffer or not? If it's not, how do we protect from unintentional modification? This spawns a lot of unnecessary .asReadOnlyBuffer() in Druid source code now, out of "fear" or for "safety". If e. g. ObjectStrategy.fromByteBuffer() accepts read-only Memory, it makes the contract clear, and no "asReadOnly" conversion is needed on the boundaries: java compiler and type system check everything for us. We should avoid just two things: explicit cast to WritableMemory in our code, and using Memory as generic type argument. Both could be enforced using Checkstyle rules.

I took the idea of giving to the read-only interfaces and static factory methods short names, and to writable versions - longer names from Kotlin as well. It favors using read-only when possible.

Question about AutoCloseable: Other than being recognized by try-with-resources construct, are there any other JVM costs associated with this? Cleaner keeps a linked-list of all Cleaners, for example. More specifically, are there any hidden costs associated with having the heap-associated parts of the Memory hierarchy AutoCloseable, when they clearly don't need it?

No, there is no penalty. AutoCloseable allows special syntax construction: try-with-resources, similarly to how Iterable allows for-each loop.

Question about "positional" capabilities: Do we need all the capabilities now present in Buffer? Or a subset? I think we might regret doing a subset.

I think we don't know the answer to this question until we actually try to migrate Druid's code. @niketh needed right away in the first PR: #3915

leerho commented 7 years ago

@leventov

Some ideas:

All classes package private except where noted. Thus, only two public classes.

The classes noted as "Thought abstractions" may not be actual classes, as they are very small with only a constructor and 2-3 instance variables. They also might appear as simple interfaces.
For example, the only difference between DirectR and DirectW is the parent class that they extend. Indentation indicates "extends".

BaseMemory // Thought abstraction

    public Memory //has "absolute" Read-Only methods and launches the rest using factory methods
        DirectR // Thought abstraction. 
            MemoryDR   //Requires AutoClosable and Cleaner
            ByteBufDR
            MapDR      //Requires AutoClosable and Cleaner
        HeapR // Thought abstraction
            MemoryHR
            ByteBufHR

        MemoryW //has the "absolute" W methods
            DirectW // Thought abstraction. 
                MemoryDW   //Requires AutoClosable and Cleaner
                ByteBufDW
                MapDW      //Requires AutoClosable and Cleaner
            HeapW // Thought abstraction
                MemoryW
                ByteBufW

    public PositionalMemory //has positional "Buffer" logic & variables, positional RO methods, and launches the rest
        DirectPR // Thought abstraction. 
            MemoryPDR.    //Requires AutoClosable and Cleaner
            ByteBufPDR
            MapPDR.       //Requires AutoClosable and Cleaner
        HeapPPR // Thought abstraction
            MemoryPHR
            ByteBufPHR

        MemoryPW //positional W methods
            DirectPW // Thought abstraction. 
                MemoryPDW.   //Requires AutoClosable and Cleaner
                ByteBufPDW
                MapPDW.      //Requires AutoClosable and Cleaner
            HeapPW // Thought abstraction
                MemoryPW
                ByteBufPW

leerho commented 7 years ago

@leventov

I think you misunderstood my previous comment. I have no problem with "separation". My concern was

But the Memory wrapForWrite(ByteBuffer) seems inconsistent and also would throw an exception if the given BB is read-only.

leerho commented 7 years ago

@leventov

Go to experimental to look at early "sketch" of what I'm thinking at the moment. The hierarchical outline is in package-info.

leventov commented 7 years ago

@leerho

But the Memory wrapForWrite(ByteBuffer) seems inconsistent and also would throw an exception if the given BB is read-only.

Inconsistent with what? Yes, it should throw exception if the given BB is read-only. Why this is bad?

Generally I think wrap(BB) might even not be needed at all. It could be needed only on the boundaries with third-party libraries, which use ByteBuffer. Even if it is needed, it is going to be used in a very few places.

leerho commented 7 years ago

@leventov

DataSketches will need wrap(BB) as there are many other users outside of Druid who have BB.

My comment on inconsistent was a misinterpretation of what you were trying to do. Never mind :)

In experimental I have the first skeleton round-trip working: reading and writing longs to direct memory. Take a look.

leventov commented 7 years ago

@leerho

I don't see your arguments against the idea of separating read-only and writable on the type system level, by making two separate interfaces, rather than on runtime level, by having impls which throw UnsupportedOperationException.

My arguments pro this idea are:

Clearer for readers, safer interfaces of methods like ObjectStrategy.read()
Don't need asReadOnly() conversion on boundaries, there is automatic upcast to the read-only interface.
Can have twice less classes in the memory package, R/W separation not needed.

leerho commented 7 years ago

@leventov

OK. You are right. I thought I would be able to save some code duplication of the R/W methods, but there will the same amount of code duplication either way. The current scheme requires 1 read method and 2 write methods (one throws). Full splitting at the public level will require 2 reads and 1 write.

I don't see the 2X reduction in number of classes. Most of the classes I have at the leaf level (MemoryDR, etc.) can be consolidated as the sister leafs only differ by the factory method. I split them up just for my own clarity while laying it out.

I will try it your way and see what happens :)

leerho commented 7 years ago

@leventov

Checked in the totally split skeleton into experimental. Again I don't see a 2X difference in number of classes just due to this change.

Why did you do the following:

  static Memory wrapForWrite(ByteBuffer);

I would think you would want the following, which is simpler:

  static WritableMemory wrap(ByteBuffer);

The same applies to map.

leventov commented 7 years ago

@leerho thank you!

Why did you do the following:

static Memory wrapForWrite(ByteBuffer); I would think you would want the following, which is simpler:

It's a mistake in my comment (now fixed), yes definitely it should return the writable interface.

Mostly LGTM, a couple of comments:

Memory allocate() doesn't make sense, only reading from just allocated memory.
We completely forgot about #3716 and the idea to use Memory instead of it's DataInput and DataOutput. For this, it seems to me that Memory should be stateless (no fields) and have a public constructor. Essentially, it should be like Java 8's interface with static factory methods and default implementations of some methods. We use abstract class just in order to overcome artificial limitations of interfaces in Java 7. Also would really like to hear @weijietong's feedback on the interfaces that we are designing.

leventov commented 7 years ago

@leerho regarding x2 less classes, I meant that the separation between read-only and writable implementations is not needed. There are only writable implementations. The only thing that e. g. map() does is delegation to mapWritable(), and return the read-only interface. It is safe on the type-system level with very little precautions. Your approach with runtime separation provides extra safety, if you want to keep this safety - yes, you either still have x2 classes, or e. g. have a boolean flag "writable" and check it guarded by assertion, e. g. in checkOffset() method, along with the offset check.

apache / druid