apache / incubator-baremaps

Create custom vector tiles from OpenStreetMap and other data sources with Postgis and Java.
baremaps.apache.org
Apache License 2.0
491 stars 56 forks source link

bugfix when key out of Boundary for MemoryAlignedDataMap #880

Closed yagagagaga closed 1 month ago

yagagagaga commented 1 month ago
sonarcloud[bot] commented 1 month ago

Quality Gate Passed Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
1.0% Duplication on New Code

See analysis details on SonarCloud

bchapuis commented 1 month ago

@yagagagaga Thanks a lot, this is a good contribution. This part of the codebase really deserves more checks and tests.

Regarding the HashMap, this is what I used initially. However, when dealing with very large datasets (several billion records), the cost of computing the hash code when accessing the HashMap is noticeable compared to the direct memory access of an array. The size of the array can easily be mitigated by increasing the size of segments. Unfortunately, I had to remove the JMH benchmarks due to license compatibility requirements (JMH is released under the GPL). I would keep the array for now and reintroduce some sort of benchmarks if we need to use another data structure

To format the code, you need to execute ./mvnw spotless:apply. Let me know if you need further assistance.

yagagagaga commented 1 month ago

Thanks for your reply, I had reformat the code and rollback the Memory.java.

bchapuis commented 1 month ago

Thank you, could you elaborate a bit more on the computation of the upperBoundary? I'm probably missing something, but I would naivly implement it as follow, which does not give exacly the same result as your implementation.

this.upperBoundary = ((long) memory.segmentSize()) * ((long) Integer.MAX_VALUE) / (long) dataType.size();
yagagagaga commented 1 month ago

As all we known, Long is a 64-bits length data type with 63-bits represents the numerical value and 1-bit represents positive or negative. The key in MemoryAlignedDataMap needs to consider the following limitations:

  1. value's data size. If the value occupies 4 bytes, it means that 2-bits (in 63-bits) are needed to represent.
  2. The calculation of segmentIndex depends on the size of segmentSize. If the segmentSize is 1024, it means that 10-bits (in 61-bits) are needed to represent and the remaining 51-bits represent the number of segments. Considering that segmentIndex is an Integer field, it will cause precision loss when the segmentIndex is greater than 32-bits. There are two cases to discuss:
    1. segmentIndex will not loss precision if segmentSize greater than 32-bits.
      value data size = 2-bits                   segmentSize > 32-bits          
       /\                            /~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~\
      0111111111111111111111111111111111111111111111111111111111111111
      ^  \__________________________/
      |               v
      |       segmentIndex <= 30-bits
      |  \___________________________________________________________/
      |                            v
      -/+                       key's range
    2. segmentIndex will loss precision if segmentSize less than 32-bits. The range of segmentIndex must be limited in order to be represented as an integer.
                                 segmentIndex = 31-bits    segmentSize = 10-bits         
                             /~~~~~~~~~~~~~~~^~~~~~~~~~~~~~\/~~~~^~~~\
      0111111111111111111111111111111111111111111111111111111111111111
      ^                      \/
      |             value data size = 2-bits  
      |                        \_____________________________________/
      |                                         v
      -/+                                   key's range

If following your implement, some wrong will happened:

// this.upperBoundary = ((long) Integer.MAX_VALUE) / (long) dataType.size() * ((long) memory.segmentSize());
var map = new MemoryAlignedDataMap<>(new IntegerDataType(), new OnHeapMemory(1024));
map.put(549755813887L, 1); // error will happen, but as the matter of fact, you can calculate segmentIndex and segmentOffset with this number

var map = new MemoryAlignedDataMap<>(new IntegerDataType(), new OnHeapMemory(1L << 34)); // if could be, the upperBoundary will numeric overflow
bchapuis commented 1 month ago

Thank you for the detailed explanation and the fix. As said, this is typically the kind of contribution we need to improve the robustness of the project. Do not hesitate to tell us more about your use case and reach out if you have questions, it would be a pleasure to collaborate!

CalvinKirs commented 1 month ago

@yagagagaga Thanks a lot, this is a good contribution. This part of the codebase really deserves more checks and tests.↳

Regarding the HashMap, this is what I used initially. However, when dealing with very large datasets (several billion records), the cost of computing the hash code when accessing the HashMap is noticeable compared to the direct memory access of an array. The size of the array can easily be mitigated by increasing the size of segments. Unfortunately, I had to remove the JMH benchmarks due to license compatibility requirements (JMH is released under the GPL). I would keep the array for now and reintroduce some sort of benchmarks if we need to use another data structure↳

Hi @bchapuis We can definitely revert this change (re-add the benchmarking code module). I believe we were just using its API without copying its source code into our project, and we won't be distributing the JMH-related jars in the binary version. https://www.apache.org/legal/resolved.html#optional https://www.apache.org/legal/resolved.html#prohibited To format the code, you need to execute ./mvnw spotless:apply. Let me know if you need further assistance.

CalvinKirs commented 1 month ago

@yagagagaga Thanks a lot, this is a good contribution. This part of the codebase really deserves more checks and tests.↳↳ Regarding the HashMap, this is what I used initially. However, when dealing with very large datasets (several billion records), the cost of computing the hash code when accessing the HashMap is noticeable compared to the direct memory access of an array. The size of the array can easily be mitigated by increasing the size of segments. Unfortunately, I had to remove the JMH benchmarks due to license compatibility requirements (JMH is released under the GPL). I would keep the array for now and reintroduce some sort of benchmarks if we need to use another data structure↳↳

Hi @bchapuis We can definitely revert this change (re-add the benchmarking code module). I believe we were just using its API without copying its source code into our project, and we won't be distributing the JMH-related jars in the binary version. https://www.apache.org/legal/resolved.html#optional https://www.apache.org/legal/resolved.html#prohibited

To format the code, you need to execute ./mvnw spotless:apply. Let me know if you need further assistance.

I looked into it, and sure enough, there's some discussion on this. https://issues.apache.org/jira/browse/LEGAL-399 :smi]

bchapuis commented 1 month ago

@CalvinKirs Thanks a lot for this pointer. It's good to know that JMH can be used (I acted as a lumberjack in the previous release to ensure that we are compliant with the license ;)