Closed leventov closed 1 year ago
Unresolved problems
+ Design
Memory
's bounds are checked optionally via assertions/guarded by static final boolean flag, "local" position and limit are check explicitly manually, with helper methods or versions of read and write methods of Memory
, which accept read/write position and "local" limits. @leventov there's a problem with moving Memory to the Druid source tree, which is that it would then make the current implementations of Sketches not work. They would have to either be re-implemented depending on Druid (which is a huge dependency for them to take) or we would have to have some sort of compatibility layer between the two.
I do think that it is likely for Druid to have some classes that make it easier to use Memory
for various operations, which would definitely live in Druid. The "read-only" version seems like something we should get the Memory
library to support.
The non-native-endian problem basically means that our initial version of the integration will have to do non-native-endian reads (i.e. convert after reading non-native-endian value). Then, in a subsequent version, we will have to create a new serialization version that is only native endian.
The close/free stuff is already done.
The MagicAccessImpl
thing looks cool. Is that basically a thing where we can put any arbitrary address and length in and it gives back a DirectByteBuffer
that uses that? If so, that could be useful, yeah.
@cheddar I think sooner or later the proposed Memory
should anyway become a part of Druid source, because it is a critical dependency and it's not good to depend on PR review/release cycle done by Yahoo exclusively, not Druid community. This is the same story as with java-util/bytebuffer-collections/extendedset and Metamarkets. Sooner, i. e. from the very beginning, is better.
Another solution is to move DataSketches lib to github.com/druid-io.
We could also add DataSketches's Memory
<-> Druid's Memory
conversion. Since Druid's Memory
is supposed to be based on the DataSketches's one, it should be easy.
What do people from Imply (@gianm, @fjy) think on this question?
The
MagicAccessImpl
thing looks cool. Is that basically a thing where we can put any arbitrary address and length in and it gives back a DirectByteBuffer that uses that? If so, that could be useful, yeah.
I think it is needed for being able to make a refactoring gradually, not all-or-nothing. But it is super-ugly, likely won't work on any JDK except OpenJDK/Oracle JDK, likely won't work on Java 9 at all. So better it to be a temporary hack. If there are huge libraries which Druid depends on and they require ByteBuffer
, it could also be a problem, because it means that the hack couldn't be removed.
@leventov @cheddar ,
+ Goal
Fwiw, I think it is better to try to do an all-or-nothing swap of the libraries and just make sure that we support the same persisted data. It's a much larger PR and will be a significant QA/code review risk, but this is also something that I think we either want to take whole-hog or not at all.
I propose we bring this up at the dev sync tomorrow, I'll make sure to mention it.
@leventov @cheddar Interesting discussion. I just have a few comments
Memory
object has to be "sliced", ...". The MemoryRegion(R)
is a view of its parent as "slice" is for ByteBuffer; no data is copied. If the concern is the over-creation of these shell objects, with Memory
you could create a read-only view of the parent once, and then just pass the offset and permitted length to the downstream users of some sub-chunk; no object creation at all. This is a little messy for the downstream users but safe. With BB this is not possible, because the position, limit, etc are always writable, which could screw up the parent.
MagicAccessorImpl
. I've been looking at this and it concerns me. DirectByteBuffer
requires calling the internal Bits.reserveMemory()
and unreserveMemory()
, which call the sun.misc.SharedSecrets()
that keeps track of all created instances of DirectByteBuffer
so that they can be properly freed. I don't see any mechanisms in the MagicAccessorImpl
that does this. Of course, I may not fully grok how it works :).Using the JVM Cleaner
. Although this might be possible, the Java gurus that I have spoken with, who work on the internals of the JVM, advise me to never use Cleaner
as it can be quite dangerous (and hazardous to your mental health :) ). This and the above point are why I don't recommend trying to create a DirectByteBuffer
from a Memory
. Creating a HeapByteBuffer
from Memory
is trivial and can be done with the current code.
Memory
already has read-only implementations of NativeMemory
and MemoryRegion
.
Memory
has no dependencies outside Java, and you can think of it as a wrapper around Unsafe
. This being the case, it would be possible to create classes in Memory
that leverage the same UnsafeUtil
and include position, limit, rewind, etc. But, is this really necessary? Being able to read old data should not be a problem, but using Memory
is a different programming paradigm that new downstream applications would need to get used to.
Memory
was designed as a fast mechanism for random access of different primitive types sharing the same backing memory, similar to a C struct
and union
combined. ByteBuffer
was designed for fast stream-based IO buffer operations, for which it does a splendid job. Reading a TCP/IP packet stream cannot be read with a random access paradigm so there is no choice. Reading data from RAM can be read either in a random access paradigm or sequentially, which depends on the application. But it seems to me to be overly restrictive for Druid to require that downstream processes must use a sequential paradigm in order to process their data out of RAM. Processing sketch data structures is a good example that requires random access, but I'm sure there will be many others.
Converting Druid away from ByteBuffers
will be a lot of work no matter what path is chosen.
@leventov On the question of whether Memory should be part of Druid or not, my thought is that it is weird for Druid to depend on sketches-core, but if Memory stuff was broken out into its own library that both Druid and sketches-core used, then that could be fine. To me the decision of incorporating anything into the Druid tree or using it as a dependency should be made based on:
IMO, if the answer to those questions are "yes" and "no", respectively, for a given library then we should incorporate it into Druid. If the answers are anything else then it's better to keep it separate. I think for extendedset, bytebuffer-collections, and large portions of java-util, Druid was the only user so it made sense to merge them in. Memory has at least one other active user (datasketches) so to me that suggests including it as a dependency.
If that causes problems for Druid development we could always revisit.
@leerho Could you elaborate on what's wrong with Cleaner? It's been added recently to a few places in the Druid code base and if it's scary I'd like to understand why.
@cheddar
Fwiw, I think it is better to try to do an all-or-nothing swap of the libraries and just make sure that we support the same persisted data. It's a much larger PR and will be a significant QA/code review risk, but this is also something that I think we either want to take whole-hog or not at all.
Would you also propose changing factorizeBuffered to take Memory on the aggregators?
@gianm yes, I would propose that the BufferedAggregators essentially become MemoryAggregators
. A good amount of the speed gains that we can expect from the change will come from the intermediate processing layer, which means that the aggregators would have to change.
This will have wide-reaching affects on anyone that has done their own aggregators as extensions, but I think that it is the only way to actually make the change (and that's part of the risk of the change and why I think it will take a while to get it actually accepted)
Sounds like it's time to make the 0.11.0 milestone in github then!
Does that mean a mmap call will return the Memory , aggregators will base on the Memory too ?
@weijietong yes, that is the proposed change. We discussed the process and we are currently exploring two phases
ComplexMetricSerde
but should not affect the aggregatorsWe will try to complete phase 1 entirely before phase 2. The Memory
interface should then also provide the interface required to integrate with other systems (like a distributed FS) if that is a requirement. We can try to guess at the right "factory" interface to enable slotting in other Memory
implementations as an extension point.
@gianm
"Could you elaborate on what's wrong with Cleaner? It's been added recently to a few places in the Druid code base and if it's scary I'd like to understand why." I just found out that our corp email was screwing up and has not been sending my replies :( !!I have a hunch that a number of my replies have been going into a bit-bucket!So here is the reply I sent at noon today.
More: There is no way to achieve from Java the performance attained using Unsafe and this is why it is so widely used. Cleaner, on the other hand, is a convenience in that it allows programmers not to have to worry about freeing resources that they have allocated.
Overusing Cleaner can cause contention, and block GC operations. See https://srvaroa.github.io/java/jvm/unsafe/2016/08/28/sun-misc-cleaner.html.
There are other attempts to create public Cleaners, but from what I have read, with varied success. DirectByteBuffers (DBBs) are created with a messy process inside the JVM using VM calls, Bits.reserveMemory(), which calls an internal SharedSecrets() that tracks all instances of DBB. and of course Cleaner.
I do not recommend attempting to create your own DirectByteBuffers outside the JVM. And I don't recommend attempting to create a DBB from a Memory object. Create DBBs sparingly, and reuse them if you can.
On Tuesday, January 31, 2017 9:38 AM, Gian Merlino <notifications@github.com> wrote:
@leerho Could you elaborate on what's wrong with Cleaner? It's been added recently to a few places in the Druid code base and if it's scary I'd like to understand why.— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Very glad to see this process, I think this will be a great huge refactor job to Druid,but it's valuable.
Btw , the coding convention of Druid ,with too much anonymous inner class 、functional programming, makes codes verbose 、unreadable. Guava has officially announced the Caveats . Functional programming is the language like scala which is good at . If you accept this opinion, maybe we can refactor the indexing and querying part of the codes by the way at this milestone.
@leerho I think Memory may implement a reference count memory recycle algorithm like what Netty
does , not to depend on the cleaner of JVM.
@leerho @weijietong
Cleaner
I don't see real options.
Mandatory manual close()
/ free()
. I want to write database in Java, not C++98. Memory leaks, use-after-free guaranteed. It works (only with very experienced teams and if a lot of resources invested into code quality) in C++ projects because C++ developers are disciplined and have a lot of experience dealing with pitfalls of manual memory management. We are not.
Mandatory manual reference counting = shared_ptr
. I don't want to write database in C++11 either. Seriously, I had experience developing "better ByteBuffer
frameworks" 3-4 times, the last one is https://github.com/OpenHFT/Chronicle-Bytes. Reference counting is implemented there. My experience from this attempt: reference counting doesn't work in Java. Fortunately Chronicle Bytes also use Cleaner
s as safety net, so manual reference counting is not mandatory with Chronicle Bytes.
It might work in Netty, but keep in mind that Netty is a more narrowly scoped framework, there are a few patterns how buffers are used there. I'm not sure they have to share buffers between concurrent threads, we have(?) in groupBy. Nevertheless I see messages in @normanmaurer's Twitter about finding another memory leak in Netty almost every single weak.
finalize()
- a slower alternative to Cleaner
, which causes GC problems.
Do you know other options?
IMO DirectByteBuffer's approach with using Cleaner
and not requiring manual memory management is just right for Java. The two mistakes that they made are
((DirectByteBuffer) bb).clean()
.AutoCloseable
, so couldn't be used in try-with-resources statement. This is why we have to add wrappers like MappedByteBufferHandler
But those mistakes could be easily fixed in Memory
.
Creates another dependency on a non-public internal set of classes of the JVM.- It is not clear to me that Cleaner is essential, or at least it hasn't been essential so far.
It is already a dependency.
Overusing Cleaner can cause contention, and block GC operations. See https://srvaroa.github.io/java/jvm/unsafe/2016/08/28/sun-misc-cleaner.html.
At least it won't be worse than now, because every DirectByteBuffer
uses Cleaner
internally. There is no reason why we should use more "native" Memory
objects, than DirectByteBuffer
objects now. Likely we can use less Memory
objects, because there is no 2 GB limitation in Memory
and some distinct ByteBuffer
s used now probably could be coalesced.
Another advantages of using Cleaner
s:
Memory
<-> ByteBuffer
, that could appear to be needed even if we take all-or-nothing refactoring approachMemory
just as we currently use ByteBuffers without regression. It will greatly reduce the risk and size of the refactoring.@cheddar @gianm
Memory
in Druid sourceDo we think we'll want to make changes to the library?
Yes during the initial refactoring and experimentation, when the refactoring is complete, maybe very rarely.
So I think the approach with breaking Memory
into a small library works, if Yahoo guys change it during the refactoring so that there is complete agreement on it's design and implementation.
I think for extendedset, bytebuffer-collections, and large portions of java-util, Druid was the only user so it made sense to merge them in.
Most important reason is that we want to make changes in those libs, which affect Druid code as well, i. e. "cross-lib refactoring". This is less likely for Memory
, again, if it is reviewed thoroughly during the refactoring.
@leerho
Memory
designNativeMemory
nativeRawStartAddress_ + objectBaseOffset_ + offsetBytes
There should be a way to avoid this cost in critical places. Hotspot JIT itself is not very good and reliable at moving field reads outside loops, even if the fields are not changed, even if the fields are `final`.
I suggest to make those offsets part of `offsetBytes` value, passed to every method. It makes `Memory` effectively stateless, if assertions (or a special boolean flag, controlling bound checks) are off.
Other approaches:
- Use `getAddress()` and raw `Unsafe` bypassing `Memory`. Downsides:
- Losing assertion-enabled bound checks
- Explicit dependencies on `Unsafe` in more parts of the code.
- Add a set of `getXxxByAddress()` and `putXxxByAddress()` methods to `Memory`, i. e. wrap bound check and `Unsafe` call. Downside: more methods in the interface, which is easy to confuse unintentionally.
memArray_
is not null
) and with "raw" address (memArray_
is null
) at the same site may have negative performance implications. To be tested.
MemoryRegion
Things I'm concerned about
assign()
.volatile
fields.mem_
.I suggested instead of MemoryRegion
, where it is used, to create another copy of NativeMemory
or NativeMemoryR
with adjusted bounds.
@leventov Thank you for your comments!
But if you read some of the visceral comments coming from the Oracle-dominated JCP and even some of the authors of the JEPs submitted to the JCP, we (you and I) are not writing in "Java" either as we have broken the rules by writing "... libraries and applications that, knowingly or not, make use of these non-standard, unstable and unsupported internal APIs." and "If you want to perform dangerous manual memory allocations, there are other languages for that." This is scary language by folks that have major influence on the future direction of Java. (There may be some glimmers of hope in the latest JEP 260 proposal, but we will have to wait and see.)
Clearly, we are writing Java code in a grey area where we are trying to get as close to the metal as possible and still produce robust and maintainable systems. I am open to morphing Memory, within reason, respecting current backward compatibility issues, into what would also be acceptable to the Druid designers.
finalize()
: The Memory
use of finalize()
is pretty benign and narrow (but I'm open to other solutions). Currently, any direct memory allocation via AllocMemory()
, does not actually rely on finalize()
directly to free memory. Instead, if the JVM calls finalize()
an exception is thrown informing the developer that he/she did not properly freeMemory()
. The only instances of Memory that even expose the finalize()
method are those that actually used AllocMemory()
or MemoryMappedFile()
to create native, off-heap memory. This was done intentionally to reduce the number of finalize objects that the JVM has to track. My understanding from @cheddar is that these are required only in a few places in Druid, and the actual number of these allocations is not huge. It is an assumption, on my part, that the number of developers who actually have to write the code that creates and frees native memory in Druid is a small number of skilled Java developers who could quickly learn the patterns doing these allocations correctly. The number of developers that receive views of Memory for reading /writing of data can be much larger. These folks will not have the burden of worrying about freeing memory and even if they did attempt to free memory it results in a no-op.
I have more comments on some of your excellent following posts to the one above. But I have to break away for a few hours.
@leerho
Two recent PRs (which @gianm mentioned): StupidPool fixes and it's follow-up specifically move away replace finalize()
to using Cleaner
s. Applying this change in Metamarkets's production led to 50% less GC CPU, less pauses and pauses shorter:
In your previous messages you raise concern about Cleaner
's scalability. I agree it's not ideal and the cost is not trivial, but finalize()
is even less scalable and slower.
Because of the mistake in the finalize()
design that allows to "resurrect" objects during the finalize()
, Objects with finalize()
overridden couldn't often been reclaimed during the specific GC pause/phase even if unreachable, that often leads to OutOfMemoryErrors. After this "StupidPool fixes" backport we see much less OOMs in production: before there were several OOMs each week, after - a few (maybe less than 5) OOMs during a month and a half.
But if you read some of the visceral comments coming from the Oracle-dominated JCP and even some of the authors of the JEPs submitted to the JCP, we (you and I) are not writing in "Java" either as we have broken the rules by writing "... libraries and applications that, knowingly or not, make use of these non-standard, unstable and unsupported internal APIs." and "If you want to perform dangerous manual memory allocations, there are other languages for that." This is scary language by folks that have major influence on the future direction of Java. (There may be some glimmers of hope in the latest JEP 260 proposal, but we will have to wait and see.)
Cleaner
is "less" an internal and unsupported API than say MagicAccessorImpl
and SharedSecrets
: sun.misc.Cleaner
is standardized in Java 9 as java.lang.ref.Cleaner
as a replacement for finalization.
I don't agree with the argument that "Cleaner is so hard to use right that only Java gurus and JDK authors should be allowed to do that."
Cleaner
successfully@leventov Ok, given that Cleaner is getting promoted, I'll concede. And if we need to migrate Memory to Cleaner, we will do it.
I want to return to your "Closer look at Memory". Thank you for your critique, your suggestions are very welcome!
First the 3 MemoryRegion
issues.
The volatile modifiers I had already decided to remove.
Making MemoryRegion
immutable is also possible, but there are some use cases, for example, where having a mutable version eliminates the need for the JVM to create millions of objects.
Removing the "delegation" aspect of the MR methods could also be fixed, but in my own testing I have found that JIT does a pretty good job of eliminating method call layers.
NativeMemory
, offset -> address conversion issues
nativeRawStartAddress_
and objectBaseOffset_
play an important role in communicating what the backing memory actually is and how to properly set up the correct addresses for Unsafe. There is a little state table at the top of NativeMemory
that illustrates the different values of these variables and the states of Memory.
The most dramatic speedup is achieved when the method called is static final. For the cases where speed is absolutely critical I create the static methods required where I pass what I call the memObj
, and memAdd
which are derived from a Memory instance. And in the most critical cases, where I need to access several such methods, rather than relying on class variables, I will bring them into the the top of the method to increase the probability that they end up on the stack. As an example, see lines 135-150. My point here is not to recommend that everyone do this, but that it is possible to do this from a Memory
instance when required.
@leerho thanks.
Making
MemoryRegion
immutable is also possible, but there are some use cases, for example, where having a mutable version eliminates the need for the JVM to create millions of objects.
Sounds reasonable, let's keep it mutable.
Removing the "delegation" aspect of the MR methods could also be fixed, but in my own testing I have found that JIT does a pretty good job of eliminating method call layers.
MemoryRegion
delegation-free helps with it.MemoryRegion
.MemoryRegion
wraps another MemoryRegion
, and the base is changed, so the derived MR is also implicitly changed. IMO it's safer and easier if we explicitly state that created MR borrows bounds and backing storage of the base MR at the moment of creation and future changes to the base MR doesn't affect the created MR.
nativeRawStartAddress_
andobjectBaseOffset_
play an important role in communicating what the backing memory actually is and how to properly set up the correct addresses forUnsafe
. There is a little state table at the top ofNativeMemory
that illustrates the different values of these variables and the states ofMemory
.
I understand this. My proposal is to incorporate those offsets into the primitive long offsetBytes
, passed around everywhere. I. e. in each Memory
, 0 is not a valid offsetBytes
, bytes offsets between Memory.start()
and Memory.end()
(example names) are valid. start()
and end()
are determined by each Memory
depending on the backing storage. I. e. for off-heap allocated Memory
, .start()
returns the allocated address. For byte[]
-backed Memory
, .start()
is ARRAY_BYTE_BASE_OFFSET
. Then, in all getXxx()
and putXxx()
no conversion is needed, only assertion-enabled bound checks and call to Unsafe with the given offsetBytes
.
Or in other words, make what is currently returned from getCumulativeOffset(offsetBytes)
the offsetBytes
.
We took this approach in Chronicle Bytes' BytesStore
and it worked out well.
> The most dramatic speedup is achieved when the method called is static final. For the cases where speed is absolutely critical I create the static methods required where I pass what I call the memObj, and memAdd which are derived from a Memory instance. And in the most critical cases, where I need to access several such methods, rather than relying on class variables, I will bring them into the the top of the method to increase the probability that they end up on the stack. As an example, see [lines 135-150](https://github.com/DataSketches/sketches-core/blob/master/sketches/src/main/java/com/yahoo/sketches/theta/DirectQuickSelectSketch.java). My point here is not to recommend that everyone do this, but that it is possible to do this from a Memory instance when required. This is what I described as an option [above](https://github.com/druid-io/druid/issues/3892#issuecomment-276595444): - Use getAddress() and raw Unsafe bypassing Memory. Downsides: - Losing assertion-enabled bound checks - Explicit dependencies on Unsafe in more parts of the code. And I can add to this list the concern of mixing access with null and non-null as `memObj`, that I mentioned above. It may turn or not to be a performance problem, but if it will, you should add x2 more boilerplate to the code like you referenced. Or it could be encapsulated in `Memory` by making two different impls: `OffHeapMemory` and `HeapMemory`. `ByteBuffer` also does this by splitting `DirectByteBuffer` and `HeapByteBuffer`, that could also be a sign that expressing them with a single class is suboptimal.
@leventov This is all very substantive feedback, thank you!
Also consider implicit mutability problem, if MemoryRegion wraps another MemoryRegion, and the base is changed, so the derived MR is also implicitly changed. IMO it's safer and easier if we explicitly state that created MR borrows bounds and backing storage of the base MR at the moment of creation and future changes to the base MR doesn't affect the created MR.
Good point! Fortunately hierarchical MR are not that frequent. But you are right that once created, the leaf node MR should directly reference itself back to the backing store and not its MR parent.
Use getAddress() and raw Unsafe bypassing Memory. Downsides: Losing assertion-enabled bound checks Explicit dependencies on Unsafe in more parts of the code.
Yep! But I see no way around these for truly critical speed sections. We just have to try to clearly identify and document these sections and limit the number of them.
And I can add to this list the concern of mixing access with null and non-null as memObj, that I mentioned above. It may turn or not to be a performance problem, but if it will, you should add x2 more boilerplate to the code like you referenced. Or it could be encapsulated in Memory by making two different impls: OffHeapMemory and HeapMemory.
ByteBuffer also does this by splitting DirectByteBuffer and HeapByteBuffer, that could also be a sign that expressing them with a single class is suboptimal.
Just because it was done in BB, doesn't by definition make it optimal. I'm not very impressed with the ByteBuffer architecture. The ByteBuffer hierarchy under the covers explodes into nearly 100 classes which had to be machine generated and vary hugely in their performance and all sharing inadequate and clumsy APIs. (e.g., HeapBB R/W of longs are about 8X slower than a Heap LongBuffer R/W of longs! No excuse!) Which may have been ok for reading and writing packet streams, but clumsy for anything else.
Let's discuss strategy. We have discussed a number of things:
nativeRawStartAddress_
and objectBaseOffset_
: If I understand your concern, it performance and that it requires two accesses of class vars instead of one. I'd have to creates some tests of this, but I have a hunch that the speed improvement would be minor compared to the penalty of the methods not being static. I have to run now, but lets try to make a list :)
@leerho
Yep! But I see no way around these for truly critical speed sections. We just have to try to clearly identify and document these sections and limit the number of them.
I suggested a way that is likely as fast as static method approach, if Memory
is a local variable in a loop, JIT will likely pull it's class check out of the loop: making "CumulativeOffset" the offsetBytes
. It's explained in details in my previous comment.
It's important to make methods on Memory
abstraction as fast as possible, not just provide workarounds. Consider BufferAggregator.aggregate()
. We don't want all buffer aggregators to use address and Unsafe
directly. The method signature should be changed to aggregate(Memory mem, long position)
. So we want trivial operations with Memory
in BufferAggregator.aggregate()
impls to be fast. Extracting Memory.getCumulativeOffset()
, Memory.array()
from Memory
in a trivial aggregate()
which makes just two operations with Memory
: one get and one put (e. g. see DoubleMaxBufferAggregator
) makes little sense.
Flink does something similar (you may already know):
I'm surprised that you'd not want to use Netty's allocator. Are the leaks that bad or that widespread? There are some goodies in addition to the allocator:
Presto:
(Pardon the intrusion, but I'm trying to understand why almost every big open source Java project ends up creating its own Unsafe based memory manager)
@AshwinJay We aren't creating yet another unsafe based memory manager, instead we are borrowing these from DataSketches Memory lib.
@leventov On the discussion regarding cleaner vs using free, here is my take:
Do you see value in having us use cleaner for Memory? Would love to know your thoughts.
@niketh
ByteBuffer.allocate()
is used 152 times, allocateDirect()
- 6 times, Files.map()
- 9 times.
We already have a finalize in Memory incase someone forgets to free explicitly. Finalize will be slower, but it is very infrequent.
finalize()
to Cleaner
is big. It was N ByteBuffer (Cleaner) + N ResourceHolders (finalize()
). It is changed to N ByteBuffers (Cleaner) + N ResourceHolders (Cleaner). If you will change to N Memory (finalize()) + N ResourceHolders (Cleaner), I expect the effect will be just opposite to the shown above.@niketh already having finalize()
is not an argument, IMHO, because changing from finalize()
to Cleaner
and back is a 100 line change with zero implications on the rest of the codebase.
@leventov Makes sense, We will make Memory use cleaners instead of finalize.
@leventov
MemoryRegion
was not worth this risk, so I made it immutable. In looking at where we had used the mutability it was basically an end-around to not having a relative position and it was easy to convert these cases to use an immutable MR. @cheddar and I discussed this and we think that what would be more useful would be a true "positional" parallel interface. It is open for discussion how this would be implemented. NativeMemory
, offset -> address conversion issues. I am still looking at this.Cleaner
.@leerho thanks.
I made some benchmarks, which are similar to Druid's TopN double aggregations: https://github.com/leventov/zero-cost-memory
Benchmark (alloc) Mode Cnt Score Error Units
ZeroCostMemoryBenchmark.processByteBuffer direct avgt 5 4.648 ± 0.106 ns/op
ZeroCostMemoryBenchmark.processByteBuffer heap avgt 5 15.783 ± 0.504 ns/op
ZeroCostMemoryBenchmark.processNoOffsetMemory direct avgt 5 3.004 ± 0.017 ns/op
ZeroCostMemoryBenchmark.processNoOffsetMemory heap avgt 5 2.988 ± 0.092 ns/op
ZeroCostMemoryBenchmark.processNoOffsetTwoImplMemory direct avgt 5 2.991 ± 0.088 ns/op
ZeroCostMemoryBenchmark.processNoOffsetTwoImplMemory heap avgt 5 3.009 ± 0.138 ns/op
ZeroCostMemoryBenchmark.processTwoFinalOffsetMemory direct avgt 5 3.675 ± 0.103 ns/op
ZeroCostMemoryBenchmark.processTwoFinalOffsetMemory heap avgt 5 3.700 ± 0.164 ns/op
ZeroCostMemoryBenchmark.processTwoOffsetMemory direct avgt 5 3.696 ± 0.116 ns/op
ZeroCostMemoryBenchmark.processTwoOffsetMemory heap avgt 5 3.704 ± 0.130 ns/op
Quick takeaways:
Will investigate assembly and verify results later.
@leerho
one-offset memory is as good as no-offset:
ZeroCostMemoryBenchmark.processOneOffsetMemory direct avgt 5 3.008 ± 0.087 ns/op
ZeroCostMemoryBenchmark.processOneOffsetMemory heap avgt 5 3.017 ± 0.165 ns/op
Given this, and that moving from 0-indexed ByteBuffers to no-offset Memory
(start()
-indexed) is risky because requires to change carefully a lot of code, while moving to 0-indexed Memory
is very straightforward (almost find-replace), I suggest to implement Memory
with a single internal offset field.
Interesting. I'm ready to turn in for the night, but I did a lot of thinking about this this weekend. Also I have some benchmarks I have created as well. I need to do some updates and will share them with u tomorrow.
@leventov
I (think) I am done with the first phase of changes to Memory.
The next thing I want to look at is simplifying the MR so that it extends NM rather than Memory. This may eliminate the need for MR or most of its code anyway. I need to figure out a way make this change backward compatible. Not sure how yet. Thoughts?
The other more radical thought I had would be to convert Memory into an abstract class that NM extends rather than implements. This would be a huge change that would definitely break Backward Compatibility. It also reduces future flexibility as it becomes a single inheritance tree. I really don't like this idea. Thoughts?
@leerho let's focus on the interface first, internals could be changed later.
1.
The other more radical thought I had would be to convert Memory into an abstract class that NM extends rather than implements. This would be a huge change that would definitely break Backward Compatibility. It also reduces future flexibility as it becomes a single inheritance tree. I really don't like this idea. Thoughts?
I actually like the idea and suggest just to do this.
Memory
, instead of a separate Memories
.putUtf8(String)
in Memory
. Can do that right in Memory
instead of separate MemoryUtils
and calling it from each Memory
implementation.So there is an abstract class Memory
with static factory methods for creating different kinds of Memory
all other classes in the com.yahoo.memory
package are package-private.
About breaking compat: IMO better to break compat early by hiding all constructors behind static factory methods that allows great flexibility later (add/remove implementations). Than to realise that we want to break compat later, after Druid is refactored to use Memory.
Make Memory
Closeable
or AutoCloseable
to allow try-with-resources.
Move PositionalMemoryRegion
suggested by @niketh from #3915 to the same package.
About implementations
The next thing I want to look at is simplifying the MR so that it extends NM rather than Memory. This may eliminate the need for MR or most of its code anyway. I need to figure out a way make this change backward compatible. Not sure how yet. Thoughts?
Since MemoryRegion is unmodifiable now, and there will be PositionalMemoryRegion
, I'm losing the point of MemoryRegion
.
Actually I think there are too much Memory
implementations. I think everything could be implemented using just one: NativeMemory
. There could be a ReadOnlyMemory
super-interface of Memory
, and to obtain a "read-only view" of Memory
, we just cast to ReadOnlyMemory
.
The memory architecture kind of grew organically from something quite simple, all the while trying to add functionality while maintaining backward compatibility. It is admittedly kind of unwieldy now and needs a complete refresh. There are some other urgent reasons I need to publish a new release of the library and the complete reworking of the memory module will take some time. So I will likely do a new release, then we can work together on what the new Memory architecture and interfaces need to look like. Let me put some thoughts into a new architecture and I'll pass it by you.
I have thought about adding Strings, but I have resisted because it is the start of a slippery slope. Strings could be done now by converting the string to a byte[] or char[] first using the encoding of your choice.
@leerho
The memory architecture kind of grew organically from something quite simple, all the while trying to add functionality while maintaining backward compatibility. It is admittedly kind of unwieldy now and needs a complete refresh. There are some other urgent reasons I need to publish a new release of the library and the complete reworking of the memory module will take some time. So I will likely do a new release, then we can work together on what the new Memory architecture and interfaces need to look like. Let me put some thoughts into a new architecture and I'll pass it by you.
Sure, there is no hurry for releasing the new API.
I have thought about adding Strings, but I have resisted because it is the start of a slippery slope. Strings could be done now by converting the string to a byte[] or char[] first using the encoding of your choice.
Yes, it is possible, but my heart bleeds when I see how inefficient it is. 3-4 unnecessary data copies, creating a lot of garbage objects, ThreadLocal
access, comparison of unrelated String
s. And it is done not only during segment construction, it could be done thousands of times in the Druid historical query path, if the query uses DimensionSelector.lookupName()
to build a BitSet.
Here is my API proposition:
abstract class Memory implements AutoCloseable {
static WritableMemory allocate(long size);
static WritableMemory allocateDirect(long size);
static Memory wrap(ByteBuffer);
static WritableMemory wrapForWrite(ByteBuffer);
static Memory map(File);
static Memory map(File, long pos, long len);
static WritableMemory mapForWrite(File);
static WritableMemory mapForWrite(File, long pos, long len);
PositionedMemoryRegion region(long start, long limit);
ByteBuffer toReadOnlyByteBuffer();
.. get methods
void close();
}
abstract class WritableMemory extends Memory {
WritablePositionedMemoryRegion regionForWrite(long start, long limit);
.. put methods
}
interface PositionedMemoryRegion extends AutoCloseable {
Memory memory();
long position();
PositionedMemoryRegion position(long position);
.. get methods without offset parameter
void close();
}
interface WritablePositionedMemoryRegion extends PositionedMemoryRegion {
WritableMemory memory();
.. put methods without offset parameter
}
@leventov
This is very similar to what I was thinking. I am working out the low level details of what offset & address values to keep and where in the hierarchy.
Some subtle differences: You seem to prefer having two primary classes: a "read-only" Memory
and WritableMemory
. But the Memory wrapForWrite(ByteBuffer)
seems inconsistent and also would throw an exception if the given BB is read-only. I would prefer having one public Memory
from which one could create the read-only and writable versions. In the case of BB, we could have:
static Memory wrap(ByteBuffer); //results in a RO `Memory` if ByteBuffer is RO.
static Memory wrapForReadOnly(ByteBuffer) //results in a RO `Memory` regardless of the state of ByteBuffer.
We would have to do something similar for mapped files.
Question about AutoCloseable: Other than being recognized by try-with-resources construct, are there any other JVM costs associated with this? Cleaner keeps a linked-list of all Cleaners, for example. More specifically, are there any hidden costs associated with having the heap-associated parts of the Memory hierarchy AutoCloseable, when they clearly don't need it?
Question about "positional" capabilities: Do we need all the capabilities now present in Buffer
? Or a subset? I think we might regret doing a subset.
@leerho
Separation between "readable" and "writable" interfaces has proven to be a good move. Most recently designed (buffer) APIs have it, e. g. in Chronicle Bytes: ReadCommon
and WriteCommon
, in Argona: DirectBuffer
and MutableDirectBuffer
. Also a bit unrelated, but Kotlin has separate read-only List
, Set
, Map
, Collection
and MutableList
, MutableSet
, MutableMap
and MutableCollection
interfaces.
This is like const
modifier in C++. Without separation, contracts of methods that take buffers as parameters are always unclear: could this method modify the buffer or not? If it's not, how do we protect from unintentional modification? This spawns a lot of unnecessary .asReadOnlyBuffer()
in Druid source code now, out of "fear" or for "safety". If e. g. ObjectStrategy.fromByteBuffer()
accepts read-only Memory
, it makes the contract clear, and no "asReadOnly" conversion is needed on the boundaries: java compiler and type system check everything for us. We should avoid just two things: explicit cast to WritableMemory
in our code, and using Memory
as generic type argument. Both could be enforced using Checkstyle rules.
I took the idea of giving to the read-only interfaces and static factory methods short names, and to writable versions - longer names from Kotlin as well. It favors using read-only when possible.
Question about AutoCloseable: Other than being recognized by try-with-resources construct, are there any other JVM costs associated with this? Cleaner keeps a linked-list of all Cleaners, for example. More specifically, are there any hidden costs associated with having the heap-associated parts of the Memory hierarchy AutoCloseable, when they clearly don't need it?
No, there is no penalty. AutoCloseable
allows special syntax construction: try-with-resources, similarly to how Iterable
allows for-each loop.
Question about "positional" capabilities: Do we need all the capabilities now present in Buffer? Or a subset? I think we might regret doing a subset.
I think we don't know the answer to this question until we actually try to migrate Druid's code. @niketh needed right away in the first PR: #3915
@leventov
Some ideas:
All classes package private except where noted. Thus, only two public classes.
The classes noted as "Thought abstractions" may not be actual classes, as they are very small with only a constructor and 2-3 instance variables. They also might appear as simple interfaces.
For example, the only difference between DirectR
and DirectW
is the parent class that they extend. Indentation indicates "extends".
BaseMemory // Thought abstraction
public Memory //has "absolute" Read-Only methods and launches the rest using factory methods
DirectR // Thought abstraction.
MemoryDR //Requires AutoClosable and Cleaner
ByteBufDR
MapDR //Requires AutoClosable and Cleaner
HeapR // Thought abstraction
MemoryHR
ByteBufHR
MemoryW //has the "absolute" W methods
DirectW // Thought abstraction.
MemoryDW //Requires AutoClosable and Cleaner
ByteBufDW
MapDW //Requires AutoClosable and Cleaner
HeapW // Thought abstraction
MemoryW
ByteBufW
public PositionalMemory //has positional "Buffer" logic & variables, positional RO methods, and launches the rest
DirectPR // Thought abstraction.
MemoryPDR. //Requires AutoClosable and Cleaner
ByteBufPDR
MapPDR. //Requires AutoClosable and Cleaner
HeapPPR // Thought abstraction
MemoryPHR
ByteBufPHR
MemoryPW //positional W methods
DirectPW // Thought abstraction.
MemoryPDW. //Requires AutoClosable and Cleaner
ByteBufPDW
MapPDW. //Requires AutoClosable and Cleaner
HeapPW // Thought abstraction
MemoryPW
ByteBufPW
@leventov
I think you misunderstood my previous comment. I have no problem with "separation". My concern was
But the Memory wrapForWrite(ByteBuffer) seems inconsistent and also would throw an exception if the given BB is read-only.
@leventov
Go to experimental to look at early "sketch" of what I'm thinking at the moment. The hierarchical outline is in package-info
.
@leerho
But the Memory wrapForWrite(ByteBuffer) seems inconsistent and also would throw an exception if the given BB is read-only.
Inconsistent with what? Yes, it should throw exception if the given BB is read-only. Why this is bad?
Generally I think wrap(BB)
might even not be needed at all. It could be needed only on the boundaries with third-party libraries, which use ByteBuffer. Even if it is needed, it is going to be used in a very few places.
@leventov
DataSketches will need wrap(BB) as there are many other users outside of Druid who have BB.
My comment on inconsistent was a misinterpretation of what you were trying to do. Never mind :)
In experimental I have the first skeleton round-trip working: reading and writing longs to direct memory. Take a look.
@leerho
I don't see your arguments against the idea of separating read-only and writable on the type system level, by making two separate interfaces, rather than on runtime level, by having impls which throw UnsupportedOperationException.
My arguments pro this idea are:
ObjectStrategy.read()
asReadOnly()
conversion on boundaries, there is automatic upcast to the read-only interface.@leventov
OK. You are right. I thought I would be able to save some code duplication of the R/W methods, but there will the same amount of code duplication either way. The current scheme requires 1 read method and 2 write methods (one throws). Full splitting at the public level will require 2 reads and 1 write.
I don't see the 2X reduction in number of classes. Most of the classes I have at the leaf level (MemoryDR
, etc.) can be consolidated as the sister leafs only differ by the factory method. I split them up just for my own clarity while laying it out.
I will try it your way and see what happens :)
@leventov
Checked in the totally split skeleton into experimental. Again I don't see a 2X difference in number of classes just due to this change.
Why did you do the following:
static Memory wrapForWrite(ByteBuffer);
I would think you would want the following, which is simpler:
static WritableMemory wrap(ByteBuffer);
The same applies to map.
@leerho thank you!
Why did you do the following:
static Memory wrapForWrite(ByteBuffer); I would think you would want the following, which is simpler:
It's a mistake in my comment (now fixed), yes definitely it should return the writable interface.
Mostly LGTM, a couple of comments:
Memory allocate()
doesn't make sense, only reading from just allocated memory.Memory
instead of it's DataInput
and DataOutput
. For this, it seems to me that Memory
should be stateless (no fields) and have a public constructor. Essentially, it should be like Java 8's interface with static factory methods and default implementations of some methods. We use abstract class just in order to overcome artificial limitations of interfaces in Java 7. Also would really like to hear @weijietong's feedback on the interfaces that we are designing.@leerho regarding x2 less classes, I meant that the separation between read-only and writable implementations is not needed. There are only writable implementations. The only thing that e. g. map()
does is delegation to mapWritable()
, and return the read-only interface. It is safe on the type-system level with very little precautions. Your approach with runtime separation provides extra safety, if you want to keep this safety - yes, you either still have x2 classes, or e. g. have a boolean flag "writable" and check it guarded by assertion, e. g. in checkOffset()
method, along with the offset check.
The goal of this issue is to agree on basic design principles and expectations from this refactoring.
Goals
ByteBuffer
always performs when making any read or write. Bound checks could be hoisted outside the hot loops manually or enabled via assertions/static final boolean flag, that is effectively free with some precautions.long
indexes. See #3743.Memory
object has to be "sliced", "converted to read-only", etc. See https://github.com/druid-io/druid/pull/3716#issuecomment-274658045.Design
It is moved to Druid source tree, not used as a dependency.See https://github.com/druid-io/druid/issues/3892#issuecomment-276589924Memory
object is immutable. "position and limit", if needed, are passed along withMemory
as two primitive longs.Memory
object has a cached immutable "view" object (which implements read methods and throws exception from write methods), this object is always returned when "as read only" is needed.close()
orfree()
onMemory
is possible, but not strictly required, there is asun.misc.Cleaner
safety net, as well as inDirectByteBuffer
.ByteBuffer
toMemory
is in progress (not expected to be done all at once, but subsystem by subsystem, class by classsee https://github.com/druid-io/druid/issues/3892#issuecomment-276548114), and also when we need to interop with external libraries which requireByteBuffer
, conversion fromMemory
toByteBuffer
and vice versa is possible. Likely it requires toDirectByteBuffer
-compatible format ofCleaner
insideMemory
DirectByteBuffer
constructors viaMagicAccessorImpl
.Memory
's bounds are checked optionally via assertions/guarded by static final boolean flag, "local" position and limit are check explicitly manually, with helper methods or versions of read and write methods of Memory, which accept read/write position and "local" limits. https://github.com/druid-io/druid/issues/3892#issuecomment-276185704Objections, additions, corrections, questions are welcome. @leerho @cheddar @weijietong @akashdw @himanshug @fjy @niketh