dapr / java-sdk

Dapr SDK for Java
Apache License 2.0
262 stars 207 forks source link

differentiate between pass-through and byte[] for DaprObjectSerializer #500

Open xiazuojie opened 3 years ago

xiazuojie commented 3 years ago

Expected Behavior

byte[] can means 2 things.

Actual Behavior

Current DaprObjectSerializer does not differentiate between pass-through byte[] and normal byte[] values. The DefaultObjectSerializer treats byte[] as always pass-through.

There are cases when serializing/deserializing byte[] as normal values is desired. For example, for json, it's desired to serialize byte[] as base64-encoded String because there's no other way to represent byte[] as legal JSON values.

I suggest to change the DaprObjectSerializer to follows:

public interface DaprObjectSerializer {
    default byte[] serialize(Object o) throws IOException {
        // passThrough by default true, to make this not a breaking change.
        return serialize(o, true);
    }

    byte[] serialize(Object o, boolean passThrough) throws IOException;

    default <T> T deserialize(byte[] data, TypeRef<T> type) throws IOException {
        return deserialize(data, type, true);
    }

    <T> T deserialize(byte[] data, TypeRef<T> type, boolean passThrough) throws IOException;

    String getContentType();
}

and DefaultObjectSerializer should be changed to honor explicit passThrough param.

Steps to Reproduce the Problem

Release Note

RELEASE NOTE:

xiazuojie commented 3 years ago

On second thoughts, it's not a good design to differentiate between pass-through and byte[] with DaprObjectSerializer. Pass-through should be specified in APIs of DaprClient.

artursouza commented 3 years ago

This issue is an example of why the DefaultObjectSerializer is just for examples and getting started but not for production use. If you need byte[] to be handled as-is, then you need a custom serializer where the content-type returned is "application/octet-stream".

artursouza commented 3 years ago

@xiazuojie If you can clarify the scenario that is not working, I might be able to help further.

xiazuojie commented 3 years ago

This issue is an example of why the DefaultObjectSerializer is just for examples and getting started but not for production use. If you need byte[] to be handled as-is, then you need a custom serializer where the content-type returned is "application/octet-stream".

The comments in many APIs clearly state "use byte[] for skipping serialization". So it's supposed to be an API behavior to always "skipping serialization" for byte[]. Thus, a custom serializer is not supposed to API behaviors?

For byte[], sometimes we do not want to "skip serialization". We want byte[] to be serialized just as any other types.

        byte[] bytes = new byte[]{1, 2, 3};
       // byte[] become a base64-encoded String after serialization by Jackson
        byte[] afterSerialized = new ObjectMapper().writeValueAsBytes(bytes);

Scenario example:

We have a store named "json-store". It can only accept JSON.

client-side code

Map<String, String> map = new HashMap<>();
map.put("foo", "1");
map.put("bar", "2");
// It works. After serialization by Jackson, map becomes {"a":"1","b":"2"}
daprClient.saveState("json-store", "key1", map);
byte[] bytesData = new byte[]{1, 2, 3};
// It does not work. bytesData should be become "AQID" after serialization by Jackson. Instead, it skips the serialization and thus is not a valid JSON.
daprClient.saveState("json-store", "key2", bytesData);

"json-store" server-side code

byte[] data = receiveData();
// data is expected to be JSON strings.
String json = new String(data);
// for map, json should be {"a":"1","b":"2"}
// for bytesData, json should be "AQID"

Suggestion

Add new APIs to explicitly set data-content-type, and a custom serializer can decide serialization strategy based on data-content-type (including skip serialization). Not based on argument types.

artursouza commented 3 years ago

@xiazuojie The DaprObjectSerializer contains getContentType() method for this purpose. There was recent code fix to handle that: https://github.com/dapr/java-sdk/pull/504

The behavior of "use byte[] for skipping serialization" should be only for ObjectSerializer because it allows us to handle serialization correctly internally for the Dapr API objects. This should not be the expectation of every implementation of DaprObjectSerializer. Did you try your example with your own serializer or did you use the DefaultObjectSerializer?

xiazuojie commented 3 years ago

To support serialization negotiation (https://github.com/dapr/dapr/issues/2905), I suggest the DaprObjectSerializer interface to be changed to:

public interface DaprObjectSerializer {

  /**
   * Serializes the given object as byte[].
   *
   * @param o Object to be serialized.
   * @return Serialized object.
   * @throws IOException If cannot serialize.
   */
  byte[] serialize(Object o) throws IOException;

  /**
   * Deserializes the given byte[] into a object.
   *
   * @param data Data to be deserialized.
   * @param type Type of object to be deserialized.
   * @param <T> Type of object to be deserialized.
   * @return Deserialized object.
   * @throws IOException If cannot deserialize object.
   */
  <T> T deserialize(byte[] data, TypeRef<T> type) throws IOException;

  /**
   * Whether the serializer supports the given MIME type of the source data
   */
  boolean canSerialize(MimeType mimeType);

  /**
   * Whether the serializer supports the given source element type and the MIME type for the output data
   */
  <T> boolean canDeserialize(TypeRef<T> type, MimeType mimeType);
}
artursouza commented 3 years ago

We can accomplish the same with the current interface. getContentType() can be used to determine if the data can be deserialized too. Plus, we still need to know which mimeType is being generated by the serializer.

Another point is adding support for multiple serializers in a single client. What is the advantage of that? Apps would need to specify which mimeType to use on every request. Do apps really use multiple serializers at the same time?

xiazuojie commented 3 years ago

We can accomplish the same with the current interface. getContentType() can be used to determine if the data can be deserialized too. Plus, we still need to know which mimeType is being generated by the serializer.

Current getContentType() can only support a single mime type. With the API I suggested, a DaprObjectSerializer can support multiple mime types. For example, mime type text/plain and text/xml can be serialized by a single TextSerializer. Also, serialization can be decided based on input TypeRef<T>.

Another point is adding support for multiple serializers in a single client. What is the advantage of that? Apps would need to specify which mimeType to use on every request. Do apps really use multiple serializers at the same time?

An app needs to integrate with other systems. Most of the time, the serialization method is not decided by the app itself, but rather by other systems. For example, redis works with byte[] (application/octet-stream) while mongoDb works with BSON (application/bson). To integrate, both application/octet-stream and application/bson must be supported.

artursouza commented 3 years ago

@xiazuojie Can the same be accomplished with a different DaprClient instance per serializer? The original design was that for apps that need multiple serializers, they could simply instantiate a different DaprClient.

xiazuojie commented 3 years ago

@xiazuojie Can the same be accomplished with a different DaprClient instance per serializer? The original design was that for apps that need multiple serializers, they could simply instantiate a different DaprClient.

Yes. It's the workaround we are currently using. But it's not ideal.

https://github.com/dapr/dapr/issues/2905#issuecomment-796537346: The users are forced to work around this by treating raw byte[] as "skipping serialization", which brings another issue. Another workaround (the one we are currently using) is creating multiple instances of Dapr client with different DaprObjectSerializer in order to work with different components.

artursouza commented 3 years ago

@xiazuojie Can the same be accomplished with a different DaprClient instance per serializer? The original design was that for apps that need multiple serializers, they could simply instantiate a different DaprClient.

Yes. It's the workaround we are currently using. But it's not ideal.

dapr/dapr#2905 (comment): The users are forced to work around this by treating raw byte[] as "skipping serialization", which brings another issue. Another workaround (the one we are currently using) is creating multiple instances of Dapr client with different DaprObjectSerializer in order to work with different components.

I understand the need to support multiple serializers within the same app. I wonder is there is a way to achieve that with a simpler interface. I will think about alternatives. I am thinking if we can have a solution where an all-in-one serializer is just an implementation detail.

xiazuojie commented 3 years ago

@artursouza Have you found alternative solution to support serialization negotiation?

artursouza commented 3 years ago

I thought about a new interface for the SDK to use and that enabled negotiation. Then, multiple serialization negotiation can be an implementation of that and not a core feature of the SDK.

So, the following interface would be part of the core SDK:

public interface DaprObjectSerializer {

  SerializedObject serialize(Object o) throws IOException;

  <T> T deserialize(SerializedObject serializedObject, TypeRef<T> type) throws IOException;
}

public class SerializedObject {
   // encapsulate properties, obviously.
   MimeType mimeType;
   byte[] data;
}

This way, an implementation of DaprObjectSerializer can be implemented with the scenario you described. Any serialization implementation should be outside the core SDK - similar to how we did for Springboot. So, it is opt-in. On the other hand, maintainers in Dapr are still hesitant to even offer that. For now, opinion about serialization belongs to applications. So, you could implement the multiple serialization on your end and we offer the interface to enable it.

WDYT?

xiazuojie commented 3 years ago

Both serialization and deserialization needs to consider serialization-type.

public interface DaprObjectSerializer {

  // Serialize according to specified mimeType
  byte[] serialize(Object o, MimeType mimeType) throws IOException;

  // Deserialize according to specified mimeType
  <T> T deserialize(byte[] data, TypeRef<T> type, MimeType mimeType) throws IOException;

  // getContentType() is removed
  //String getContentType();
}

Also, for this to work, mimeType needs to be part of request and response objects.

artursouza commented 3 years ago

That interface is too specific for the multi-serialization solution - which is one implementation of the interface.

The difference here is who decides which mimetype to use when serializing an object and when:

  1. The app when making one request into DaprClient()
  2. The app when instantiating DaprClient(), providing the implementation of DaprObjectSerializer()

Your solution seems to assume approach 1 while our current design assumes approach 2.

Is my perception correct? How would you see one way vs the other?

xiazuojie commented 3 years ago

That interface is too specific for the multi-serialization solution - which is one implementation of the interface.

Yes, that's the reason why I suggest the following interface in the first place. The serializer impl only decides whether it supports specified MimeType. It's DaprClient that pick a suitable serializer for specified MimeType.

public interface DaprObjectSerializer {
  /**
   * Serializes the given object as byte[].
   *
   * @param o Object to be serialized.
   * @return Serialized object.
   * @throws IOException If cannot serialize.
   */
  byte[] serialize(Object o) throws IOException;

  /**
   * Deserializes the given byte[] into a object.
   *
   * @param data Data to be deserialized.
   * @param type Type of object to be deserialized.
   * @param <T> Type of object to be deserialized.
   * @return Deserialized object.
   * @throws IOException If cannot deserialize object.
   */
  <T> T deserialize(byte[] data, TypeRef<T> type) throws IOException;

  /**
   * Whether the serializer supports the given MIME type of the source data
   */
  boolean canSerialize(MimeType mimeType);

  /**
   * Whether the serializer supports the given source element type and the MIME type for the output data
   */
  <T> boolean canDeserialize(TypeRef<T> type, MimeType mimeType);
}

The difference here is who decides which mimetype to use when serializing an object and when:

  1. The app when making one request into DaprClient()
  2. The app when instantiating DaprClient(), providing the implementation of DaprObjectSerializer()

Yes, I prefer solution 1. Solution 1: DaprClient is not coupled with specific serialization method. There can be a single global instance of DaprClient.

artursouza commented 3 years ago

@xiazuojie I am not a fan of this approach. I would need to find a solution that is less verbose and does not require mimetype to be passed in on every request. It does not make sense for an app to use multiple serialization mimetypes at once - I am still not convinced this is a common scenario (yes, it can exist).

Another way I was thinking was to have a "DaprChannel" instance that is globally unique (singleton) and then "DaprClient" uses a serializer and a "DaprChannel" and can be disposed or reused. The current design could somewhat be refactored into that.