Consider use of ATD for the definition of Catala types representation in the different backends

AltGr commented 7 months ago

This PR has some issues related to this.

In several contexts, we want clearly defined datatypes for interchange: types of arguments and returned values when compiled Catala code is used as a library, exploration of Catala execution traces, etc. ; this is already obvious in the explanation backends and in the way our OCaml runtime has built-in JSON output.

ATD is "a language for defining data types across multiple programming languages and multiple data formats." It is mature, in use in the wild and seems actively maintained. The syntax is simple, OCaml-like and allows for annotations for specific backends. These definitions can be compiled into type definitions for OCaml, Python, Java, Typescript, etc. that we could reuse in our different backends.

In addition, ATD is completed with generators that provide i/o functions of the defined types in the various backends, both through JSON and biniou (a custom, more efficient binary format). Annotations can be used to customise the representation (e.g. wether dicts or associative lists should be used).

A point that may be interesting in the case of the usage in our explanations web-app is that the generation of JSON schemas is also possible.

A way to leverage it could be to have Catala generate, alongside its ouput source code in the backend target language, an ATD file that would be compiled to the expected type definitions for the same language. We can then call the appropriate atdgen command and use the resulting type definitions (or embed the relevant part of atdgen and run it as a library from Catala).

It could be interesting for the user to be able to tweak these files with annotations, comments, or custom data validators, but that leaves open the question of how we can synchronise them.

AltGr commented 7 months ago

This triggered another idea: if conversion functions to a common format are provided for the runtime objects of all our backends, this can be used as a gateway to easily convert between them. We could leverage this as a "cheap" FFI, for example to be able to call functions defined as externals from other backends than the one currently in use.

A concrete use-case could be: a user is only interested in the Python backend and needs to define a module with a few external functions. At the moment, unless they rewrite their externals in OCaml besides Python, they wouldn't be able to use the interpreter anymore (and even then, one has to trust that the implementations really are equivalent) ; with this trick, the interpreter could use the conversion of the runtime objects provided by ATD to translate the arguments, feed them into the actual Python code, and convert the result back to the interpreter.

Of course, this wouldn't be very efficient, but for such use-cases it shouldn't really matter.

EmileRolley commented 6 months ago

A way to leverage it could be to have Catala generate, alongside its ouput source code in the backend target language, an ATD file that would be compiled to the expected type definitions for the same language. We can then call the appropriate atdgen command and use the resulting type definitions (or embed the relevant part of atdgen and run it as a library from Catala).

Sounds nice! However, there is currently only Java, Scala, Typescript, Python and OCaml supported by ATD right now. This means that to add backends you'll have to contribute to ATD, no?

It could be interesting for the user to be able to tweak these files with annotations, comments, or custom data validators, but that leaves open the question of how we can synchronise them.

I'm not sure why there would be a need to modify the generated files :thinking:

AltGr commented 6 months ago

Sounds nice! However, there is currently only Java, Scala, Typescript, Python and OCaml supported by ATD right now. This means that to add backends you'll have to contribute to ATD, no?

Indeed, that would be best, but it would remain possible to write a custom conversion for our types for these backends — just as we are doing now. Cases vary, for example:

an R backend might be easily contributed as it would probably be very close to the Python one
a C backend, however, needs many tricks for implementation and won't be canonical (besides needing allocations, etc.) so it may remain outside the scope of ATD. We can still use our custom type implementations for this case (writing a JSON input/output with a format compatible with our ATD specification manually if needed).

I'm not sure why there would be a need to modify the generated files 🤔

Modifying generated files directly sure is problematic without a complex update system, but custom data validators, in particular, sound like a functionality we may want to leverage. I am not sure what could be done here but it's probably worth looking in what they provide.

EmileRolley commented 6 months ago

Modifying generated files directly sure is problematic without a complex update system, but custom data validators, in particular, sound like a functionality we may want to leverage.

My point is that if there is a need for custom data validators they could automatically be generated alongside the type definitions or if it's very specific, it's up to the end-user program to take care of it, isn't it?

CatalaLang / catala

Consider use of ATD for the definition of Catala types representation in the different backends #585