Avro arrays: `null` vs `[]`

JoepvanGenuchten commented 2 years ago

Avro supports empty arrays, so this should be reflected in the transformation logic

bartkl commented 2 years ago

I think the first step would be to check if this is not already functioning as desired. I expect Lancaster to handle this automatically for us, but of course it might be that we're not calling the library in the right way for that edge case. So again: needs to be checked!

JoepvanGenuchten commented 2 years ago

We need to test if this is correct: will an empty array validate against the schema?

bartkl commented 2 years ago

It's also important to consider what the consequences of our choices are here: an optional array is different from a required array that is empty.

Discuss this with Sjoerd.

bartkl commented 2 years ago

I've given this some thought.

Open world vs closed world Let's take an example to make things easier. Say you have a sensor and measurements. In the open world, if no statements are made about measurements regarding a specific sensor, we don't know how many (if any) there are. On the other hand, if we state the measurements to be an empty array, we know there's none. However, SHACL and Avro assume a closed world, in which case these two concepts are equal, i.e.: in a closed world having no statements about measurements (which is equivalent to Avro's null) is equivalent to stating there are no measurements (an empty array).

Bad idea: custom null semantics Note that some might argue that null could be chosen to represent the fact that this sensor doesn't support measuring (ignore the practical absurdity of that example please), whereas the empty array would express there are no measurements. This choice is indeed possible, and would mean a difference between the two even in a closed world. This is a bad idea though. First of all, it gives special meaning to null that is different from the default one (i.e. expressing absence). Also, it's not explicit, or has to be made explicit in some way. This is confusing and maintenance heavy. The fact of the matter is: data is being conflated into a single property here, where it would be semantically preferable to separate the information into two fields: isMeasurable and measurements. That way, the desired explicitness and separation of concerns is secured.

Consequences Assuming reassigning meaning to null is a bad idea, and knowing we are in the context of the closed world here, it is true that a null and [] express the same thing, namely a quantity of zero, and in the context of relationships a cardinality of zero.

Note: because of the possibility of an empty array, there is no way to express a cardinality range of 1-many. This is by design of Avro, but it should be better reflected in the transformation specification on the wiki and the code (particulary the cardinality mapping).

So, if semantically there is no difference in this context, what does it matter what way we map array cardinality? There could be consequences for the user of the schema. For instance, if a field has a nullable array type, this allows leaving out the field altogether, where an array type would require the presence of the field at all times. This influences validation.

Way forward My preference is as follows:

If a cardinality of zero is allowed/expected, the nullable array type communicates this in a more transparant way.
In the case of cardinality zero, leaving out fields is generally fine. To require the presence of that field even then is a custom validation requirement that should not be part of our schema generation.

Note: I have made some assumptions about how Avro works here, particulary regarding "leaving out fields of cardinality zero" and perhaps also how it validates "empty arrays versus absent values" (as Joep pointed out).

bartkl / metamorph

Avro arrays: `null` vs `[]` #37