Closed yagagagaga closed 1 month ago
Issues
3 New issues
0 Accepted issues
Measures
0 Security Hotspots
No data about Coverage
1.0% Duplication on New Code
@yagagagaga Thanks a lot, this is a good contribution. This part of the codebase really deserves more checks and tests.
Regarding the HashMap, this is what I used initially. However, when dealing with very large datasets (several billion records), the cost of computing the hash code when accessing the HashMap is noticeable compared to the direct memory access of an array. The size of the array can easily be mitigated by increasing the size of segments. Unfortunately, I had to remove the JMH benchmarks due to license compatibility requirements (JMH is released under the GPL). I would keep the array for now and reintroduce some sort of benchmarks if we need to use another data structure
To format the code, you need to execute ./mvnw spotless:apply
. Let me know if you need further assistance.
Thanks for your reply, I had reformat the code and rollback the Memory.java.
Thank you, could you elaborate a bit more on the computation of the upperBoundary? I'm probably missing something, but I would naivly implement it as follow, which does not give exacly the same result as your implementation.
this.upperBoundary = ((long) memory.segmentSize()) * ((long) Integer.MAX_VALUE) / (long) dataType.size();
As all we known, Long is a 64-bits length data type with 63-bits represents the numerical value and 1-bit represents positive or negative.
The key
in MemoryAlignedDataMap
needs to consider the following limitations:
value
's data size. If the value
occupies 4 bytes, it means that 2-bits (in 63-bits) are needed to represent.segmentIndex
depends on the size of segmentSize
. If the segmentSize
is 1024, it means that 10-bits (in 61-bits) are needed to represent and the remaining 51-bits represent the number of segments. Considering that segmentIndex
is an Integer field, it will cause precision loss when the segmentIndex
is greater than 32-bits. There are two cases to discuss:
segmentIndex
will not loss precision if segmentSize
greater than 32-bits.
value data size = 2-bits segmentSize > 32-bits
/\ /~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~\
0111111111111111111111111111111111111111111111111111111111111111
^ \__________________________/
| v
| segmentIndex <= 30-bits
| \___________________________________________________________/
| v
-/+ key's range
segmentIndex
will loss precision if segmentSize
less than 32-bits. The range of segmentIndex
must be limited in order to be represented as an integer.
segmentIndex = 31-bits segmentSize = 10-bits
/~~~~~~~~~~~~~~~^~~~~~~~~~~~~~\/~~~~^~~~\
0111111111111111111111111111111111111111111111111111111111111111
^ \/
| value data size = 2-bits
| \_____________________________________/
| v
-/+ key's range
If following your implement, some wrong will happened:
// this.upperBoundary = ((long) Integer.MAX_VALUE) / (long) dataType.size() * ((long) memory.segmentSize());
var map = new MemoryAlignedDataMap<>(new IntegerDataType(), new OnHeapMemory(1024));
map.put(549755813887L, 1); // error will happen, but as the matter of fact, you can calculate segmentIndex and segmentOffset with this number
var map = new MemoryAlignedDataMap<>(new IntegerDataType(), new OnHeapMemory(1L << 34)); // if could be, the upperBoundary will numeric overflow
Thank you for the detailed explanation and the fix. As said, this is typically the kind of contribution we need to improve the robustness of the project. Do not hesitate to tell us more about your use case and reach out if you have questions, it would be a pleasure to collaborate!
@yagagagaga Thanks a lot, this is a good contribution. This part of the codebase really deserves more checks and tests.↳
Regarding the HashMap, this is what I used initially. However, when dealing with very large datasets (several billion records), the cost of computing the hash code when accessing the HashMap is noticeable compared to the direct memory access of an array. The size of the array can easily be mitigated by increasing the size of segments. Unfortunately, I had to remove the JMH benchmarks due to license compatibility requirements (JMH is released under the GPL). I would keep the array for now and reintroduce some sort of benchmarks if we need to use another data structure↳
Hi @bchapuis We can definitely revert this change (re-add the benchmarking code module). I believe we were just using its API without copying its source code into our project, and we won't be distributing the JMH-related jars in the binary version. https://www.apache.org/legal/resolved.html#optional https://www.apache.org/legal/resolved.html#prohibited To format the code, you need to execute
./mvnw spotless:apply
. Let me know if you need further assistance.
@yagagagaga Thanks a lot, this is a good contribution. This part of the codebase really deserves more checks and tests.↳↳ Regarding the HashMap, this is what I used initially. However, when dealing with very large datasets (several billion records), the cost of computing the hash code when accessing the HashMap is noticeable compared to the direct memory access of an array. The size of the array can easily be mitigated by increasing the size of segments. Unfortunately, I had to remove the JMH benchmarks due to license compatibility requirements (JMH is released under the GPL). I would keep the array for now and reintroduce some sort of benchmarks if we need to use another data structure↳↳
Hi @bchapuis We can definitely revert this change (re-add the benchmarking code module). I believe we were just using its API without copying its source code into our project, and we won't be distributing the JMH-related jars in the binary version. https://www.apache.org/legal/resolved.html#optional https://www.apache.org/legal/resolved.html#prohibited
To format the code, you need to execute
./mvnw spotless:apply
. Let me know if you need further assistance.
I looked into it, and sure enough, there's some discussion on this. https://issues.apache.org/jira/browse/LEGAL-399 :smi]
@CalvinKirs Thanks a lot for this pointer. It's good to know that JMH can be used (I acted as a lumberjack in the previous release to ensure that we are compliant with the license ;)
after: