Closed JoepvanGenuchten closed 2 years ago
I think the first step would be to check if this is not already functioning as desired. I expect Lancaster to handle this automatically for us, but of course it might be that we're not calling the library in the right way for that edge case. So again: needs to be checked!
We need to test if this is correct: will an empty array validate against the schema?
It's also important to consider what the consequences of our choices are here: an optional array is different from a required array that is empty.
Discuss this with Sjoerd.
I've given this some thought.
Open world vs closed world
Let's take an example to make things easier. Say you have a sensor and measurements. In the open world, if no statements are made about measurements regarding a specific sensor, we don't know how many (if any) there are. On the other hand, if we state the measurements to be an empty array, we know there's none. However, SHACL and Avro assume a closed world, in which case these two concepts are equal, i.e.: in a closed world having no statements about measurements (which is equivalent to Avro's null
) is equivalent to stating there are no measurements (an empty array).
Bad idea: custom null
semantics
Note that some might argue that null
could be chosen to represent the fact that this sensor doesn't support measuring (ignore the practical absurdity of that example please), whereas the empty array would express there are no measurements. This choice is indeed possible, and would mean a difference between the two even in a closed world. This is a bad idea though. First of all, it gives special meaning to null
that is different from the default one (i.e. expressing absence). Also, it's not explicit, or has to be made explicit in some way. This is confusing and maintenance heavy. The fact of the matter is: data is being conflated into a single property here, where it would be semantically preferable to separate the information into two fields: isMeasurable
and measurements
. That way, the desired explicitness and separation of concerns is secured.
Consequences
Assuming reassigning meaning to null
is a bad idea, and knowing we are in the context of the closed world here, it is true that a null
and []
express the same thing, namely a quantity of zero, and in the context of relationships a cardinality of zero.
Note: because of the possibility of an empty array, there is no way to express a cardinality range of 1-many. This is by design of Avro, but it should be better reflected in the transformation specification on the wiki and the code (particulary the cardinality mapping).
So, if semantically there is no difference in this context, what does it matter what way we map array cardinality? There could be consequences for the user of the schema. For instance, if a field has a nullable array type, this allows leaving out the field altogether, where an array type would require the presence of the field at all times. This influences validation.
Way forward My preference is as follows:
Note: I have made some assumptions about how Avro works here, particulary regarding "leaving out fields of cardinality zero" and perhaps also how it validates "empty arrays versus absent values" (as Joep pointed out).
Avro supports empty arrays, so this should be reflected in the transformation logic