Open deaktator opened 7 years ago
Here's a question for the consumer. Is it easier to consume:
@switch
annotation. The byte code should be checked for a tableswitch
instruction. See here for more details.For instance:
{
"scores": [
{ "value": 1 },
{ "value": "car" },
{ "value": true }
]
}
{
"scores": [
{ "intValue": 1, "valueType": "int" },
{ "stringValue": "car", "valueType": "string" },
{ "booleanValue": true, "valueType": "boolean" }
]
}
Another question. What to do about null
.
{ "value": null }
vs {}
. AND
{ "intValue": null, "valueType": "int" }
vs { "valueType": "int" }
.
I lean toward the latter in both scenarios.
Something like the following is also possible where each score type has an associated record type but this has icky syntax implications (but does allow homogenous arrays):
@namespace("com.github.deaktator.scores")
protocol ScoreProtocol {
record IntScore { int _1 = 0; }
record StringScore { string _1 = ""; }
record DoubleVectorScore { array<double> _1 = []; }
record Score {
union { null, IntScore, StringScore, DoubleVectorScore } value = null;
}
}
val s = Score.newBuilder.setValue(IntScore.newBuilder.set_1(1).build).build
val value = s.value._1 // <-- YUCK!
Again, this stuff could likely be made pretty with type classes, so it's not a really big deal, but this doesn't seem to provide much over the current implementation except for homogenous vector types.
Something like this might also be possible:
@namespace("com.github.deaktator.scores")
protocol ScoreProtocol {
record ModelId {
union { null, long } id = null;
union { null, string } name = null;
}
record IntScore {
union { null, ModelId } modelId = null;
union { null, int } value = null;
union { null, array<Score> } subvalues = null;
}
record DoubleVectorScore {
union { null, ModelId } modelId = null;
union { null, array<double> } value = null;
union { null, array<Score> } subvalues = null;
}
record Score {
union { null, IntScore, DoubleVectorScore } value = null;
}
}
This last suggestion would be nice as it reflects the idea of RootedTreeAuditor; however, there many problems with this in Avro and it might not be feasible. For instance, here's just a few related issues:
All of these issues are presentations of the same problem: Avro doesn't seem to support mutually recursive types.
This ticket seems promising AVRO-1723. It has been merged to master but doesn't appear to be in any releases yet (as of 1.8.2) because the issue still exists.
avro-tools 2>&1 | head -n1 | ggrep -Po '\d(\.\d+)*'
outputs 1.8.2
.
// mut_rec_protocol.avdl
@namespace("test")
protocol MutRecProtocol {
record C1 {
union { null, C2 } next = null;
}
record C2 {
union { null, int } value = null;
union { null, C1 } next = null;
}
}
Running avro-tools
still produces the error:
avro-tools idl mut_rec_protocol.avdl
outputs:
Exception in thread "main" org.apache.avro.compiler.idl.ParseException: Undefined name 'test.C2', at line 4, column 19
at org.apache.avro.compiler.idl.Idl.error(Idl.java:68)
at org.apache.avro.compiler.idl.Idl.ReferenceType(Idl.java:875)
at org.apache.avro.compiler.idl.Idl.Type(Idl.java:789)
at org.apache.avro.compiler.idl.Idl.UnionDefinition(Idl.java:195)
at org.apache.avro.compiler.idl.Idl.Type(Idl.java:807)
at org.apache.avro.compiler.idl.Idl.FieldDeclaration(Idl.java:590)
at org.apache.avro.compiler.idl.Idl.RecordDeclaration(Idl.java:554)
at org.apache.avro.compiler.idl.Idl.NamedSchemaDeclaration(Idl.java:155)
at org.apache.avro.compiler.idl.Idl.ProtocolBody(Idl.java:389)
at org.apache.avro.compiler.idl.Idl.ProtocolDeclaration(Idl.java:229)
at org.apache.avro.compiler.idl.Idl.CompilationUnit(Idl.java:116)
at org.apache.avro.tool.IdlTool.run(IdlTool.java:65)
at org.apache.avro.tool.Main.run(Main.java:87)
at org.apache.avro.tool.Main.main(Main.java:76)
Many of these tickets are not only unresolved, but it seems that some of the authors are ideologically against the idea. Additionally, patches and PRs are very slow to get into Avro. AVRO-1723 was started in 2015 and was only merged May 2017.
And here are some other comments:
Thinking a bit more about this request.... I am not sure it is a very good idea. It is simple, of course, to see how this works from a Java implementation. Java already supports circular references, so I thought maybe Avro could as well. However, Avro is not Java. Just consider the 'getSchema().toString()' that this would cause. Normally, you would get the completely de-referenced schema of the object. If Avro supported circular references, this would be infinitely long, continually bouncing back and forth between the references. Changes could be made to still support how that is represented, but I am not sure I like the impact. —Doug Houck
Current Implementation
The current aloha_avro_score.avdl uses Avro's support for co-products based on the built-in
union
type. As such, the description of a score looks like:Deficiencies
There are some deficiencies with this approach. Using the
avro-tools
tool chain to compile this Java code results in aScore.value
of typeObject
since Java has poor support for co-products.Most non-default tool chains don't support
union
types with 3 or more constituent types and only support 2-typeunion
s when one of the types isnull
. The support for the latter is done to encodeOption
types in Scala.Additionally,
union
types in Avro can't support a union of differentarray
s. Therefore,array
types appearing in aunion
must be mixed, even if that's not desired (as it's a more lax type).Proposal
This proposal is to make the Avro
Score
message more like the protobuf version that encodes co-products via multiple nullable variables, one per desired output type and a type indicator variable.While the type indicator variable isn't strictly necessary, it can speed lookups from O(T) (where T is the number of types) to O(1). Obviously, for this to be type-safe, we need type classes (one instance per possible type) to extract the
value
from theScore
.This approach will work better with current tool chains, which will allow us to not have to special case Aloha scores.
Avro IDL for Aloha Score
So, the proposed Avro description of
Score
would be something like:Aloha Score extensions
Scala Extraction Syntax
Like in the protobuf case for protobuf Scores, we should have type classes to extract the value. For simplicity, it would be nice to have
get
andapply
syntax analogous to Scala's scala.collection.GenMapLike trait. This should be very easy to accomplish and would look like thisThis should involve importing the syntax, but the conversions should all be in the appropriate implicit scope so that converters don't need to be explicitly pulled into scope.
Additional Coproduct Support
Perhaps it's also useful to consider coproducts at the type level using an encoding with something like cats or iota.
Java Extraction Syntax
T.B.D.
Perhaps something a rich score wrapper with
getT
methods for each output typeT
. These should probably throw when the incorrect type is provided. A determination of up-casting should be considered. For instance, what happens when the consumers asks for aDouble
but the type was aLong
or aFloat
. The current protobuf Score code does this but this should be readdressed at this time and the Avro and protobuf should be made consistent. Perhaps common interfaces could be extracted and we could have rich implementations that provide more seamless interoperability.Implementation Details
Map types are coming to Avro soon for multilabel learning and this presents a problem. Quadratically many types in the key and value spaces will be required to encode this type of score coproduct. Therefore, it might be a good time to think about generating the Avro schema programmatically. This might be possible with something like the sbt-boilerplate plugin. If not, we might want to think about writing such a plugin.