boudicca-events / boudicca.events

Event Aggregation/Publishing System
https://boudicca.events
GNU General Public License v3.0
28 stars 8 forks source link

Structured Data #367

Open kadhonn opened 3 months ago

kadhonn commented 3 months ago

In the last few weeks we had multiple discussions about different features that need more structure in our simple string->string map, mainly discussions about lists, i18n features and embedding more structure like json for combining lists and objects or similar stuff. But we never got a real satisfactory answer to how to do it in a good way, so to start this discussion here are my thoughts and reasoning until now..

Structured Data

Motivation

So, why/where do we need to care about the different data formats/structures?

Our Architecture is structured in three parts: Collectors, Core, Publishers Collectors and Publishers will mostly be specialized and to some degree have to agree on the same property names/format anyway to work together, so this is not the primary concern here. But our Core, especially the Search Service is conceptually as generic as possible, meaning it should work with pretty much all property names/formats. And for things like searching, sorting, faceting, ... to work correctly, we need to know how to handle the different data structures.

Some features we have do need more structure to be supported fully, I have identified some for now:

  1. Lists. Sometimes a property will have more then one value, for example in our concert.bandlist property
  2. Variants. Properties can have multiple variants of it's value, for example i18n properties which have different translations
  3. Dates. We have dates in our data, startDate and endDate for now, but there may be others as well. This is important for sorting
  4. (Maybe?) Numbers. Maybe we want to support numbers specifically? Also important for sorting
  5. (Maybe?) Structured Data. This one is kinda weird and could be anything? I am not sure how to generalize this. Maybe we should call this blob or something as a marker to ignore normal handling and treat it as opaque?

Possible solutions

I can think of three ways to convey structure:

  1. Hardcode it. Simply, the search service has to know beforehand how data is to be handled, via code or some configuration files. This would be about the same data as the Collectors and Publishers have as well and would make it not generic at all. At this point we simply have a normal schema, so we could just as well use ical or any other standard instead of making our own, so I do not prefer this.
  2. Guess it. Look at the data and/or query and try to guess what kind of data it has to be. This is what currently is done and it is errorprone because sometimes there can be overlaps in structure, like a normal text beginning with a "[" which could be interpreted as a list.
  3. Metadata. We could also add some metadata to the event itself, which publishers and core services could look at. Examples would be something like having the normal property name and then having a meta-property named ?name or name$schema which contains the structure of the data, maybe a simple date or list. Or another example would be what Open Street Map is doing with its i18n support by adding some metadata to the property name itself, like name:cz and name:en for different locales. You could do this for lists as well. We of course could embed this metadata in the value of the property itself, but that makes everything a bit more complicated to parse, especially because you would need special handling for simple text as well.

Soo, that is how far I am with this topic, what are your thoughts?

twatzl commented 2 months ago

I think there is multiple things to split up here:

Datatype:

Variants

And then there is variants like:

description:en description:de

which might be nested like description:markdown:en

for datatypes

my suggestion (not sure if its a good one) would be to extend the EntryDTO to support the datatypes as follows

existing one

typealias Entry = Map<String, String>
data class ComplexEntry(
val keyValues: Map<String, String>,
val lists: Map<String, String>,
val blobs: Map<String, String>,
val dates: Map<String, String>,
val times: Map<String, String>
)

That way the search service could know the data type, but without having to specify which field has which datatype.

Downside would be that eventcollectors would have to put it in the correct datatype and support everything correctly. But we could provide a nice API/Utilities to hide that from eventcollector programmers..

for variants

My suggestion for variant handling would be that search service just drops everything after the first colon in the key and when doing operations just applies them on all variants of the keys.

That way e.g. if you search for description contains "xyz" all languages of the description would be searched automatically.