atomicdata-dev / atomic-data-docs

Atomic Data is a specification to make it easier to exchange data.
https://docs.atomicdata.dev
MIT License
17 stars 7 forks source link

Generic Array Datatype #127

Open theduke opened 1 year ago

theduke commented 1 year ago

There currently only is a resource-array datatype, which requires using nested resources if there are multiple values.

Often I would want to have a property with multiple plain values though.

Reasons:

Image you want an array of ints or strings.

So there should be a datatype for "array of type T".

Defining the nested type would run into similar issues as #126 though.

joepio commented 1 year ago

I always felt like this was bound to come up at some point. I think you're right, we probably need an Array datatype.

I think that if a Property has the Array datatype, it should also indicate which types of elements are supported. Maybe it has a second datatype, namely innerDatatype, which refers to the shape of the items in the array (e.g. String or Integer).

theduke commented 1 year ago

This brings up an interesting modeling problem.

How do you express "array of integers" in the schema?

This is actually the more general problem of "how to refine types".

I see several solutions, all of them with downsides.

Additional Properties on Property Resources

A property of type atomicdata.dev/datatypes/array could use a atomicdata.dev/properties/array-item-type property to specify the expected type of array items.

The big downside here is that it would not be apparent from the schema that this property is expected or required as a refinement of the array datatype, so that makes the schema more cryptic and implementations more complicated.

It's also more complex to "unify" and compare schema types, since libraries now need to understand that the array-item-type property, and convert those into an Array<T> type for processing.

Custom Datatypes

Have something like a ../classes/ArrayType class, which requires the array-item-type property.

Properties can then specify their type (usually with a nested resource, probably) as an ArrayType.

The downside here is that libraries now have to understand what an ArrayType means, and need code to unify different ArrayType definitions into a Array<T> type for things like queries, filters, etc.(as above)

Express Types With a Core Type System

In my factordb implementation I went in a somewhat different direction.

I don't allow defining arbitrary datatypes. Types have to be expressed in terms of the built-in core type system.

A simplified definition of the core types in Rust looks a bit like this:

pub enum ValueType {
    Const(Value),

    Any,

    Unit,

    Bool,
    Int {
        min: Option<i64>,
        max: Option<i64>,
    },
    UInt {
        min: Option<u64>,
        max: Option<u64>,
    },
    Float {
        min: Option<f64>,
        max: Option<f64>,
    },
    String {
        min_length: Option<u64>,
        max_length: Option<u64>,
        regex_validators: Option<Vec<String>>,
    },
    Bytes {
        min_length: Option<u64>,
        max_length: Option<u64>,
    },

    // Containers.
    List {
        item_type: Box<Self>,
        min_length: Option<u64>,
        max_length: Option<u64>,
    },

    /// A mapping from keys to values
    Map {
        key_type: Box<Self>,
        value_type: Box<Self>,
    },

    /// 
    Object(ObjectType),

    /// An anonymous union of different types.
    Union(Vec<Self>),
    /// Tagged union (aka sum type / ADT)
    Variant(VariantType),

    /// Reference (aka foreign key) pointing to another entity
    Reference {
        /// Restrict the allowed entity types.
        allowed_types: Option<HashSet<Ident>>,
    },

    /// A custom data type.
    Named(Ident),
}

Properties can either specify a concrete ValueType as their type (serialized as a nested object), or a custom datatype, but custom datatype entities essentially only provide a named definition for a specific ValueType.

The main advantage here is that clients will always be able to understand and work with all data.

More complex types can always be expressed in terms of this core schema, and worst case they can just use a bytes array or string for arbitrary serialization.

(including things like ObjectType or Map here is probably very debatable because it is hard to express in something like a triple/quad format, and might be better expressed with something like nested resources, but I don't have that yet)