[Scala] Allow customizing reference tracking

ghostdogpr commented 11 months ago

Is your feature request related to a problem? Please describe.

Disabling reference tracking gives much better performance, but it is a bit dangerous with some types.

What we do currently using Kryo is to make a custom ReferenceResolver where we implement our own public boolean useReferences (Class type) method. That way, we dynamically disable reference tracking on all our "known" safe types, but we still use it for other "unknown" types that may be circular (for example, Throwable can be circular because of cause).

Describe the solution you'd like

A way to customize reference tracking.

Additional context

For reference our current benchmarks:

[info] Benchmark                         Mode  Cnt   Score    Error  Units
[info] KryoBenchmarks.baseKryo          thrpt    5   2.567 ±  0.181  ops/s
[info] KryoBenchmarks.customKryo        thrpt    5  11.826 ±  0.348  ops/s
[info] KryoBenchmarks.noTrackingKryo    thrpt    5  13.822 ±  0.376  ops/s
[info] KryoBenchmarks.baseFury          thrpt    5   5.254 ±  0.414  ops/s
[info] KryoBenchmarks.noTrackingFury    thrpt    5  21.651 ± 10.453  ops/s

Fury without tracking is almost 2x faster than Kryo without tracking. So we have good hope that with customized tracking we would achieve a better performance than customized Kryo.

chaokunyang commented 11 months ago

Good point! I want to add such features before but haven't the time for it.

We have a writeRef check method in io.fury.resolver.ClassResolver:

  public boolean needToWriteRef(Class<?> cls) {
    if (fury.trackingRef()) {
      ClassInfo classInfo = getClassInfo(cls, false);
      if (classInfo == null || classInfo.serializer == null) {
        // TODO group related logic together for extendability and consistency.
        return !cls.isEnum();
      } else {
        return classInfo.serializer.needToWriteRef();
      }
    }
    return false;
  }

It's used mainly in Collection/Map element serialization or in the process of FURY codegen. Here is the design consideration we make before: Do not invoke this method everytime when a new object is being serializing, since it introduces a hashmap cost, whose cost will be similar to reference tracking when the object graph is small. It only give better performance for big object graph. In such graph, map of object classes are much smaller than map of reference objects. So the query on whether to tracking ref is much smaller than tracking a ref.

So we made a tradeoff, if a object is registered for no-ref tracking, all of its subclass are no-ref tracking too mostly. In this way, we can ignore reference tracking check in the generated code for polymorphic types to minimize such check cost.

Hope this information can give you some inspiration how ref tracking works in fury, and let you write your refresolver. Fury can provide a method to let you set the RefResolverfactory when configuring FuryBuilder. We make the created refResolver as a final field of Fury to reduce field access cost. You should pass a factory to let Fury create your refresolver.

chaokunyang commented 11 months ago

Another method is provide a method in Fury such as trackingRef(Class, bool), you can invoke it to control which classes will be serialized by ref.

Or Fury provides an annotation to let you mark your classes or fields with trackingRef, this is mentioned in #1148 too.

Those methods all make sense to me, and I believe Fury will support them all in the long run.

apache / fury