[Java] optimize map performance for reference tracking and serializer dispatch

chaokunyang commented 7 months ago

Is your feature request related to a problem? Please describe.

This issue is to continue the performance optimization discussion of map in https://github.com/apache/incubator-fury/issues/1274#issuecomment-1997259888

Additional context

Tha hash lookup is a time-consuming operations. Especially after F applied codegen speedup, the cost of hash look for class dispatch or reference tracking is considerable, and sometimes it become the bottleneck of the serialization.

chaokunyang commented 7 months ago

Hi @tommyettinger , let's continue the discussion in https://github.com/apache/incubator-fury/issues/1274#issuecomment-1997259888 .

We used 0.25 as the default loader actor for class serializer dispatch in https://github.com/apache/incubator-fury/blob/main/java/fury-core/src/main/java/org/apache/fury/resolver/ClassResolver.java#L201 . But for reference tracking, we used 0.51f, since the object graph may be huge, a smaller loader factor will use more memory, and then map is big, there may have some L1 cache miss. But I thinks we can change it to 0.5f.

For class serializer dispatch, some times some class are more hot than others, I was also thinking whether is it possible to adjust the map hierarchy to make it faster for hot keys. But not sure whether it will bring big difference.

chaokunyang commented 7 months ago

Just found that fury has already removed the hash multiplication:

  protected int place(K item) {
    return System.identityHashCode(item) & mask;
  }

tommyettinger commented 7 months ago

RE: removed multiplication, good, that was one of the optimizations I wasn't sure had made it into Fury.

I've got a JMH benchmark that can test several different Map implementations at once, but I can't run it at the moment because all 12 logical cores on this laptop are currently devoted to an unrelated statistical test on a random number generator, and that test will run for at least several more hours. I've got the JMH benchmark set up to test HashMap, ObjectObjectMap from jdkgdxds, the corresponding Map type in FastUtil, and the hash-based Map in Koloboke 1.0.0 . I'll see if I can add FuryObjectMap, ~~it just needs to implement Map~~ OK, it doesn't implement Map, so it would need to be run separately, which isn't quite a fair comparison. However, ObjectObjectMap from jdkgdxds is very close. It's possible Koloboke will do well here, which it appeared to do 8 years ago when it last updated, but a lot has changed in the current JDK versions. I'll see if I can test on a large number of Class instances, to have similar conditions to the large object graph scenario.

chaokunyang commented 7 months ago

Looking forward to your benchmark results

tommyettinger commented 7 months ago

I was able to run the benchmark, and I may have an avenue that could improve contains() and get() performance.

Contains

This was testing just contains() with Object keys, some of which were Class instances from java.** packages, and some of which were simple Dummy instances, where Dummy is a class that stores its name and that's it, using identity comparison like Class. The one case where anything performed noticeably better was when hashing the .toString() result of the Object (but .getName() on Class would also work), and I believe that only did well because the String and its hashCode are cached. That didn't do better every time, and some other operations took longer.

The full benchmark data on various operations is available here as a raw .txt: https://github.com/tommyettinger/assorted-benchmarks/blob/529bd6a8e48ceb3fbbb7f0021f990394e1fd3cf6/jmh/Identity_Map_21.txt and here as a nicely-formatted spreadsheet (.fods format from LibreOffice and I think OpenOffice): (direct link) https://github.com/tommyettinger/assorted-benchmarks/raw/2538fcf164e27b3defc368436f8e2799225f75e9/jmh/Identity_Map_Performance_21.fods

The confusing thing is that there seems to be some deoptimization going on in the JVM, since the benchmarks each start about twice as fast as when they actually start recording data. It could be my laptop heating up quickly, but that seems unlikely.

You can run the benchmarks yourself by loading the jmh project in the repo I linked and running the shadowJar task. Then you can use a command like...

java -jar benchmarks.jar "JDKIdentityMapBench" -p impl=JDK_O2O_HASH,JDK_O2O_IDENTITY,JDKGDXDS_HASH,JDKGDXDS_TOSTR,JDKGDXDS_IDENTITY -p size=100,1000,10000 -wi 4 -i 5 -f 1 -w 5 -r 5 -p payloadType=OBJECT_UNIFORM

... which will run the Class-and-Dummy-Object test with the implementations for HashMap, IdentityHashMap, ObjectObjectMap, the new ObjectMapToString, and IdentityObjectMap, on sizes 100, 1000, and 10000, with 4 warmup iterations, 5 actual iterations, and each iteration taking 5 seconds. My apologies for how messy the benchmarking project is... It's mostly a testbed to see if something is worth investigating further.

The catch to the ObjectMapToString is that it needs getName() to always be unique for a Class, and I don't know if that can always be guaranteed. If it is unique and cached (not calculated at runtime), then this could be faster than getting identityHashCode()-s.

tommyettinger commented 6 months ago

OK, I've been thinking about this for a while, and tinkering a lot with different options. I was surprised to find that at least one version of cuckoo hashing really can perform quite well (though not as dramatically better on containsKey() performance), as long as it never encounters two different keys with the same exact hashCode(). I modified an existing Apache v2-licensed cuckoo hash map, reducing allocations as much as possible, and the performance gains are considerable on some operations:

POPULATE
IdentityCuckooMap  : 198676.832 ns/op
JDK IdentityHashMap: 508930.787 ns/op
CONTAINS
IdentityCuckooMap  :   1023.728 ns/op
JDK IdentityHashMap:   1418.562 ns/op
COPY
IdentityCuckooMap  :  28171.380 ns/op
JDK IdentityHashMap: 302847.095 ns/op
ITERATE
IdentityCuckooMap  :  27954.503 ns/op
JDK IdentityHashMap: 149939.553 ns/op

The catch here is that if a cuckoo hash "fails," it can enter an infinite loop or allocate memory forever, etc. Very bad worst-case properties. However, the internals of this particular cuckoo-hashed map and the existing FuryObjectMap are quite similar, and I'm working on an approach that will "flip" from using cuckoo hashing as long as it is viable, to using linear probing as FuryObjectMap currently does, if fully-colliding keys are present (or some other situation would force a rehash). I'm not sure what kind of speed penalty the flipping will impose; it is possible branch prediction will figure out that fully-colliding keys are extremely rare, and that would mean the cost would be very low relative to the above numbers. But, the code is larger for each method, which might be a problem for inlining. I'll have to benchmark and see... I know the JDK HashMap can also change its implementation to use a TreeMap-like Red-Black Tree if it encounters excessive collisions on Comparable keys, so this isn't an unprecedented model.

I'm testing the FlipMap code in my jdkgdxds repo for correctness, and I'll be benchmarking it using what should be similar code in my assorted-benchmarks repo.

chaokunyang commented 6 months ago

Great, looks pretty promising! The caller is org.apache.fury.resolver.MapRefResolver#writeRefOrNull, would be nice if it can be inlined into this method.

I took a look at https://github.com/tommyettinger/assorted-benchmarks/blob/master/jmh/src/main/java/de/heidelberg/pvs/container_bench/FlipMap.java#L207. Would it be possible to seperate the branch less hit into a separate method, so the method body is smaller for inline. JVM has a max 325 bytecode size for inline be default. And in fury, we merged the put and get together into putOrGet to reduce two hash look up into one.

tommyettinger commented 6 months ago

I've done the steps you described, and it has gone from not inlining (code too large) to inlining when hot! I keep finding various small things to change, but they're at least getting smaller. The last major change wouldn't have even affected a map with Class keys (I had accidentally left in some comparisons as == from when this was purely an identity-comparing map; they use equals() now). I'll look into putOrGet() next.

I have a nagging worry that some major part of this might not behave correctly. I have some tests from a very old version of Apache Harmony that I've been using for some data structures... If you happen to know of a better suite of tests for compatibility with Set and Map, that would be great! I know implementing the interfaces isn't critical for Fury since what's there now doesn't have to implement Map, but being able to run a standard set of tests makes me more confident these will work how they should.

tommyettinger commented 6 months ago

I'm making progress on the rest of the map types now; IdentityMap seems done and ObjectIntMap might be done or close to it. Hybridizing the two shouldn't be hard at all. Is there a particular benchmark that you think would best test the performance of these map types? It looks like running all benchmarks would take well over a day.

chaokunyang commented 6 months ago

Hi @tommyettinger , thanks for this hard work.

Fury has a src/main/java/org/apache/fury/benchmark/MapSuite.java which did some benchmarks for IdentityMap.

For end-to-end tests, src/main/java/org/apache/fury/benchmark/UserTypeSerializeSuite.java can be used. You can use following config:

  public static void main(String[] args) throws IOException {
    if (args.length == 0) {
      String commandLine =
          "org.apache.fury.*UserTypeSerializeSuite.fury* -f 3 -wi 10 -i 10 -t 1 -w 2s -r 2s -rf csv";
      System.out.println(commandLine);
      args = commandLine.split(" ");
    }
    Main.main(args);
  }

org.apache.fury.benchmark.state.BenchmarkState#references should be set to true only, otherwise referecne tracking won't be enabled:

  @Param({"true"})
  public boolean references;

tommyettinger commented 6 months ago

Well, the FlipMap turned out to be significantly slower than all other maps I tried, due to an error I made when writing put() and related code. With put() fixed so it actually puts a key and value in place, FlipMap is quite slow... I've been thinking about alternatives, though.

I've never used ClassValue, myself, but could it maybe be possible to avoid any Class-based hashtable if a ClassValue was registered per-Fury-instance that stores (somehow) a registered ID for a Class, in the Class itself? I also looked into hacky ways that might allow sun.misc.Unsafe to store an id directly in a Class, but even if that could work, it doesn't allow multiple Fury instances to have different registered classes. ClassValue might be able to do it, but I really have no idea what I am doing with that code.

chaokunyang commented 6 months ago

What do you mean use class value? You mean use it to cache a I'd for class? This will have a hash look up cost too

apache / fury

[Java] optimize map performance for reference tracking and serializer dispatch #1409

Is your feature request related to a problem? Please describe.

Additional context