Update schema2* tools to support arbitrarily named schemas

piotrswigon commented 4 years ago

Today schema2* tools rely on schemas having a name, ideally a single name. This name is used as a name of the generated entity class.

This is suboptimal, as schemas can have 0 or more than 2 names, e.g.

// This schema has no names
schema * as MySchema
  Text value

// This schema has 2 names
schema Object Thing
  Text value

or more actionably, on the particle spec:

particle Reader
  in * {Text value} object
  in Product Element Thing {Text value} myThing

Schema2* tools only need to generate schemas that are defined in the particle specs. The proposal is to use particle and the connection name as the name of the generated entity class. E.g. Reader.Object, Reader.MyThing.

Kotlin support is more ~P1. C++ support is more ~P2, but it would be best to do both at the same time and be done with it.

alxmrs commented 4 years ago

Is it alright if the codegened name is namespaced with an underscore? For example, Reader_MyThing over Reader.MyThing.

mykmartin commented 4 years ago

This approach has a fairly significant problem. If multiple particles use exactly the same inline schema we would expect them to operate on the same entity types, but in C++ they will be distinct classes that do not naturally interop.

One possible alternative would be to generate the "real" class name by hashing the schema in some normalised manner, then using type aliases for convenient programmatic usage.

piotrswigon commented 4 years ago

Yes, we discussed this. Kotlin also has type aliases that we can use to dedup.

I would say skip optimisation unless it is relatively trivial, but we do need to keep track of that.

At this point in time it is not clear how wasm modules will be used - if we end up having many small wasm modules, then this is a minor issue.

piotrswigon commented 4 years ago

@alxrsngrtn Yes, Reader_MyThing is acceptable if Reader.MyThing is hard to achieve (I presume because Reader thing generated by the schema2kt would conflict with the actual particle class?).

mykmartin commented 4 years ago

I don't think this is about optimisation. Say I have:

particle Foo
  in Data {Text t} input
  out [Data {Text t}] output

...then Foo.Input and Foo.Output will be different classes, and output.store(input.get()) won't work.

I suppose we could detect that within particles, and ensure a single underlying class is generated with aliases. If two different particles have the same inline schema, they can only interact via handles which means JS mediates the transfer and the type misalignment shouldn't matter. Hmm, it might work.

piotrswigon commented 4 years ago

Interesting!

We probably need to be smarter, e.g. for

particle Foo
  in Object Data Thing {Text t, Number value} input
  out [* {Number value}] output

You should be able to do output.store(input.get()) and the classes are not the same. I presume that cannot even be done at compile time, can it?

Maybe if we do the analysis of schemas and make one the subclass of the other. But that also doesn't work, because subclassing does not match the lattice typing of Arcs. E.g.

particle Foo
  in Object Data Thing {Text t, Number n, Bytes b} input
  out [* {Number n, Text t}] output1
  out [* {Number n, Bytes b}] output2

now input would need to subclass both output1 and output2 and you have the problem of Number n being there twice. It could work with language supporting mixins natively?

Do we need to fallback to runtime checks?

piotrswigon commented 4 years ago

It looks like we need to rework how handles are typed in Cpp and Kotlin. In both cases they gate storing on type membership... @mykmartin thoughts?

mykmartin commented 4 years ago

We haven't started looking at type slicing or structured data alignment in wasm yet. I don't think your first example can be handled at compile time - and even it could, it may risk combinatorial explosion of functions mapping between the different known (compatible) schemas.

So yes, dynamic reflection would need to be added to the entity classes to make this work in an automated way. Of course, devs can always just read and set the common fields manually.

So do we discard typed entity classes altogether (as you just implied)? Just thinking on the fly; entities would be a single, concrete class with lists of string-named fields, probably separated by primitive types:

class Entity {
  // get/set a number-typed field (not the field called 'num')
  double getNum(std::string name) { return num_fields[name]; } 
  void setNum(std::string name, double val) { num_fields[name] = val; } 

  std::map<std::string, double> num_fields;
  std::map<std::string, std::string> str_fields;
  ...
};

That would make for a fairly cumbersome API. Ideally an array-style notation would be a bit better (entity['field']) but you can't overload on return type, so that doesn't work.

mykmartin commented 4 years ago

What if we still had schema-specific accessor fields (e.foo()) on typed generated classes, but have them inheriting from a common non-template base containing reflection info? Then lower-level storage APIs (i.e. Singleton.set et al) would operate on the base class and do everything via reflection.

piotrswigon commented 4 years ago

Idea: What if we typed handlles with interfaces (java / kotiln) / abstract classes (c++) instead of concrete classes?

Interfaces don't have the constraints of concrete classes and you could have multiple inheritance without conflicts (I'm not 100% sure you can extend multiple abstract classes with the same method signature in C++, but you would know).

If we do that, we can have the type safety at compile time I think.

e.g.

particle Foo
  inout Data Thing {Text t, Number n} input1
  in Object Data Thing {Text t, Number n, Bytes b} input2
  out [Data {Number n, Text t}] output1
  out [* {Number n, Bytes b}] output2

would generate 4 interfaces (excuse pseudocode):

Foo_input1 extends Foo_output1
  Text t()
  Number n()

Foo_input2 extends Foo_input1, Foo_output2, Foo_output1
  Text t()
  Number n()
  Bytes b()

Foo_output2
  Number n()
  Butes b()

Foo_output1
  Number n()
  Text t()

With this, if you have a Handle<Foo_output1> you can store in it any of Foo_output1, Foo_input2, Foo_input1. You would obviously need some concrete implementations that are returned by handle.get() / toList() as well.

Does this make sense?

alxmrs commented 4 years ago

now input would need to subclass both output1 and output2 and you have the problem of Number n being there twice. It could work with language supporting mixins natively?

Kotlin doesn't have mixins, but they do have traits (the difference being trains don't contain state) (interfaces are an alias for traits). C++ has multiple inheritance, so this is potentially feasible. Certainly an interesting approach!

Interfaces don't have the constraints of concrete classes and you could have multiple inheritance without conflicts (I'm not 100% sure you can extend multiple abstract classes with the same method signature in C++, but you would know).

This should totally be possible in C++. I like this idea. I think we could use a combination of interface inheritance (treating them like traits) and type-aliases to provide good compile time support.

alxmrs commented 4 years ago

I definitely prefer the interface approach over going down the dark path of reflection.

What if we still had schema-specific accessor fields (e.foo()) on typed generated classes, but have them inheriting from a common non-template base containing reflection info? Then lower-level storage APIs (i.e. Singleton.set et al) would operate on the base class and do everything via reflection.

If I understand you correctly, this means that we could "simulate" static typing via the handles API? Could you go more into this?

piotrswigon commented 4 years ago

@cromwellian for thoughts.

mykmartin commented 4 years ago

Nice idea Piotr. For C++ I think this can work, with some details to be ironed out. Something like:

Detect the graph of strict subset and exactly equal matches between schemas.
Generate interfaces (for C++, pure abstract virtual classes) for common field sets.
Generate entity classes that inherit from those and provide the actual fields (note: entity classes would not inherit from other entity classes).
Change Singleton and Collection handles to operate only on the virtual interfaces.

So using Piotr's example particle:

Foo_Internal_1
  virtual Text t()
  virtual Number n()

Foo_Internal_2
  virtual Number n()
  virtual Bytes b()

Foo_input1 implements Foo_Internal_1
  Text t()
  Number n()

Foo_input2 implements Foo_Internal_1, Foo_Internal_2
  Text t()
  Number n()
  Bytes b()

Foo_output2 implements Foo_Internal_2
  Number n()
  Butes b()

Foo_output1 implements Foo_Internal_1
  Number n()
  Text t()

Handles will need to encode/decode entities via the virtual interface, not the concrete type. This might need some sort of type helper in the concrete classes - I'm still working through the details.

mykmartin commented 4 years ago

So this turns out to be a lot more complex than it seems. Since you can have arbitrarily complex subset relationships between schemas, you end up with a lattice of interfaces and concrete classes. It very quickly runs into a diamond inheritance problem of arbitrary depth/complexity. Josh and I have tried to find a clean solution but couldn't find a way to build a class hierarchy allowing automatic slicing without using virtual inheritance everywhere (which feels like a very heavyweight approach).

Our current thinking is to keep the concrete entity classes as they are (modulo aliasing or simple inheritance for inline schemas) and adding template constructors to allow slicing:

class Foo_input1 {
  template<typename T> Foo_input1(const T& x) : t_(x.t()), n_(x.n()) {}
  string t() { return t_; }
  double n() { return n_; }
  string t_;
  double n_;
}

So any other entity with the same fields can be used to construct Foo_input1. Those that can't won't compile. Since the constructor can be used implicitly, the a.store(b.get()) pattern should just work.

This gets more tricky when you start thinking about optional fields. With all entity fields being required, moving between entity types can only narrow the field set (A{x,y} cannot write to *{x,y,z}). When some fields are optional, we probably want to allow widening as well (A{x,y} can write to *{x,y,z?}).

mykmartin commented 4 years ago

One flaw with the template approach above: it doesn't account for different schema names. So Foo{Text t} and Bar{Text t} should not be compatible, but the template constructor isn't aware of the name mismatch.

piotrswigon commented 4 years ago

It seems to me that you only need to consider entities that are used by a particle, or, at most: entities used in a wasm module, as particles running in the same wasm module could exchange entities directly.

Given that, there's a relatively small set of schemas to consider. I expect typical particles will have 2-6 connections, you may have a few particles in a module, so there's at most a few dozens of entities to analyse for subset relationships.

The reason why we're thinking about C++ abstract classes / Kotlin traits is that you can have diamond inheritance. Without diamond inheritance you cannot do type lattice.

@mykmartin can you elaborate why you would prefer to avoid virtual inheritance, given we don't other solution for modelling the type lattice? Also, as you noted the schema names need to be considered as well. If we do schema subsetting analysis at entity class generation time we can take that into account and handle it fine.

I am wondering whether the constructor approach would hit any issues with "I create a new entity of this type" vs. "I just use a constructor to write this entity on this connection - but please don't create a new entity, I want the same Arcs entity written".

mykmartin commented 4 years ago

Fair bit to unpack here...

@Cypher1 came up with solutions for both the optional fields and schema name matching concerns I mentioned previously.

It seems to me that you only need to consider entities that are used by a particle, or, at most: entities used in a wasm module, as particles running in the same wasm module could exchange entities directly.

Yes, we want to consider schemas entirely within the scope of a single particle. Particles from the same wasm module should not exchange entities other than via a handle, in which case they'll have a schema for the connection. (If they do share things directly they should fail dynamic dataflow analysis.)

can you elaborate why you would prefer to avoid virtual inheritance, given we don't other solution for modelling the type lattice?

At first I thought we would have to mark every inheritance as virtual to avoid complex analysis, but actually it's only needed on those classes that inherit from a common one - and even then, we can impose a very simple condition of 'if B and C inherit from A and both have sub-classes, use virtual' without needing to fully discover all diamond patterns in a given graph.

Also, it's not really as heavyweight as I first thought - it just adds an extra indirection on base member lookups, so not a huge cost.

My final concern was a mistaken understanding of implicit conversions. I thought that I had found that, given the standard diamond pattern A < B,C > D, D couldn't be implicitly referenced as A - but I clearly messed up the virtual inheritance when checking before, because it works just fine.

If we do schema subsetting analysis at entity class generation time we can take that into account and handle it fine.

Agreed. That would be a benefit of using an inheritance approach. It would probably make handling optional fields much easier too.

I am wondering whether the constructor approach would hit any issues with "I create a new entity of this type" vs. "I just use a constructor to write this entity on this connection - but please don't create a new entity, I want the same Arcs entity written".

I don't think that would be a problem. The code snippet for Foo_input1 above omitted a bunch of stuff for brevity, including a default no-argument constructor. There would be no ambiguity when slicing an entity as to which is executed.

So I think the conclusion is that inheritance is a good approach for C++, and assuming it's also good for Kotlin we should go with that.

piotrswigon commented 4 years ago

Thank you Myk :)

piotrswigon commented 4 years ago

Good point regarding wasm modules. Let's just limit ourselves to per-particle analysis @alxrsngrtn .

piotrswigon commented 4 years ago

Update: We will focus on the initial issue, which is making anonymous schemas and multi-named schemas working. The follow up problem of reflecting the tap lattice will be handled at a later stage.

@mykmartin may have some update on the latter problem.

PolymerLabs / arcs

Update schema2* tools to support arbitrarily named schemas #3663