Annotation Data Design - Githubissues

Related discussions: #33 #38 #24

See also https://github.com/alpheios-project/documentation/blob/master/development/lex-domain-design.tsv which defines some domain requirements for creation of data in the data store

Data design needs to accommodate:

linking of all known Alpheios (both user and alpheios supplied) data about a lexical entity
a wide variety of lexical entity types
a wide variety of languages
ability for data to be public or private
ability to consume data from and contribute data to LOD sources such as the the Lila Lemma Bank (https://lila-erc.eu/sparql/) the Latin Wordnet (https://latinwordnet.exeter.ac.uk/), the Greek Wordnet etc.
standard lexical ontologies
user contribution of both data entities and relationships between entities
user comments on any entity, relationship or comment
ability to filter data by a wide variety of criteria (including its provenance)

In domain driven design there is a repository pattern that doesn't apply perfectly to our use cases, but I think we need something like that. I think we need a place where we can aggregate the results of individual resource queries (instantiated as alpheios data model objects) and reach back into as needed to recompose the data objects we present to the available for the user to view and annotate.

I think that based on the all above we can conclude that lexical data and annotations are different domain contexts and should be kept separate as much as possible. We, however, need a context mapping between those two contexts. The lexical data could be the supplier of the information and the annotation data would be a consumer.

I think something like the repository pattern may work well. There could be two repositories: of lexical data and of annotations. The code that does not use annotations would pull data from the lexical data repository (get lexical data for a specific word). The annotations-aware code would pull data from the annotation repository (annotations for a specific word). The code of the annotation data object then would get what it needs from the lexical data repository, and will combine information from two repositories together.

The lexical data should assembled so that it would be possible to track how this data as combined. It should allow data to be recombined a different way at any moment. Same can be said about annotations.

I've tried to apply principles of the DDD to everything that was said above and here is what I came up with:

Let's consider two workflows of retrieving data by the client that needs lexical data only (as we do currently) and the other client that expect it to be enriched with the annotation data. Please note that I'm using the terms in their Domain Driven Design (DDD) meanings.

Workflow without annotations: Initiated by the user input, the presentation layer calls a getLexeme() (should probably be a getWord()) method of the Lexical Data App Service (LDAS). This is something that Lexis does currently. LDAS checks the Lexical Data Repository (LDR) for a word. We don't have a repository object now but need it for the history support. If the word is not in the repository, the LDR will go to the corresponding client adapters to obtain data, create a Word Aggregate out of it, store it internally, and return to the LDAS service. The LDAS will transform the Word Aggregate (i.e. combine data from different sources, disambiguate it) and will create a Word DTO (Data Transfer Object). The Word DTO will be returned to the presentation layer that will display it to the user. We might have several various DTOs each serving its specific purpose.

The Word Aggregate is the object that stores lexical data from all sources separately, without transforming it in any way other than mapping source values such as grammatical features into the values of our domain context. The mapping would be a responsibility of client adapters (they do that currently). The lexical data from the sources would be combined using either the methods of the Word Aggregate or the LDAS service.

The change we need to make to match this model is to move the logic to combine data from different lexical sources out of the client adapters into the LDAS.

Workflow with annotations: Initiated by the user input, the annotations-aware presentation layer calls a getWord() method of the Annotated Lexical Data App Service (ALDAS). ALDAS checks the Lexical Data Repository for a word and the Annotations Data Repository (ADR) for the annotations data. If the word and the annotations are not in the repository, it will go to the corresponding client adapters to obtain data. It will create a Word Aggregate and put it into the LDR repository. It will create one ore several annotation aggregates and will store them in the ADR. All aggregates will be returned to the ALDAS service. The ALDAS will combine the Word Aggregate and the Annotations Aggregate data in accordance with the getWord() options and will create an Annotated Word DTO. The Annotated Word DTO will be returned to the presentation layer that will display it to the user.

With the architecture as the above we will keep data from different lexical sources separate and will be able to recombine it in different ways. During this we would store some transformations history data that will describe what sources were used to produce the final data and what transformations were applied. That will allow to display that information within the presentation layer.

Does the above make sense?

Right now our lexical objects from data models (Lexeme, Lemma, Inflections) has traits of both aggregates (because they have the domain business logic such as disambiguation methods) and of the DTOs (because their purpose is to be used in the presentation layer). I'm not sure how it's best to deal with them. Should we put them into the aggregate and treat as entities and value objects? That will require some changes that may make them incompatible with the other clients that are using them. Should we treat them as DTOs (because I think what they actually are now, functionally) and ignore the business logic functionality? That would require to create lexical entities and aggregates from scratch but will give us much more freedom. Or should we do something else? What do you think?

I think this is about right.

The change we need to make to match this model is to move the logic to combine data from different lexical sources out of the client adapters into the LDAS.

A small point but I think technically this logic to combine data from different lexical sources actually currently lives in a combination of the lexis module and the data model objects (e.g. Homonym, etc.)

I need to think a bit about the question about the lexical objects and the DTOs.

I would like to summarize what I think we can do in order to support annotations. Here is a detailed diagram showing all the architectural components and the workflow of getting the word data:

The object that stores lexical data is the Word. It is an aggregate root, a container that stores lexemes. Each lexeme contains definitions and inflections (I'm listing only the most important items here). Each inflection contain features. Words, lexemes, definitions, and inflections are entities, they have their own unique IDs that are based on content. Assertions, negations, and comments are entities too. All other objects are value objects. All the object mentioned above would be domain objects, existing only within the domain context.

The rule for the aggregate roots is there should be no references from the outside to the objects that are stored inside the root. All changes to the object within the aggregate root (i.e. the Word) should be done only through the methods of the Word. I think we can satisfy this requirement so the Word can hold instances, not ID references, of lexemes. If lexemes would be accessed from the outside, they should be stored in their own repository and the Word object should keep their IDs, not direct references.

Keeping object instances within the Word object is the simplest solution so I think we should go with it unless it will cease to satisfy our requirements.

Actually, the Word may not need to hold the lexemes objects at all. It may store results of the lexical queries returned by the client adapters in form of plain JSON-like objects. But the Word could provide a method with options that, when called, will return a set of lexemes (a DTO) that will be constructed out of those lexical query results according to the user options provided to the method. This lexemes DTOs will then be used to show lexical data to the user in the UI. The lexemes DTOs might be cached using a combination of word, language, context, and options as a key in order to avoid duplicate method calls. DTOs exist in the presentation layer.

I think our current lexical objects (Lexeme, Inflection, Definition, etc.) are better suited for the role of DTOs because that's what they were created for: to display lexical data to the user. So if we would need corresponding domain object, the should be presented by different, newly designed classes.

The data retrieval workflow could work the following way.

When user selects a word in the UI, the presentation layer (the Vue component) sends request to the application layer represented by the Lexis module. The Lexis module then asks the Word Repository to return the Word object. If the requested word is in the repository, it is returned to the Lexis module. If not, the Word Repository creates an empty Word object and returns that empty object to the Lexis.

Once created, the Word object initiates queries to client adapters and/or GraphQL interfaces to obtain all lexical and annotations data that is required for its existence. What exactly is needed is determined by the word data (word, language, context). Those are passed as parameters to the Word constructor. References to client adapters and GraphQL interfaces (so that the Word would know where to obtain the data) are passed to the Word constructor as well.

Once each piece of lexical or annotations data is retrieved, the Word object fires a domain event. Lexis and Annotations modules listen to those events and update the flags in the Vuex store. The tracking structure in the Vuex could be a an object with the following structure:

{
  wordID1: {
      lexicalDataUpdated: dateTimeValue,
      annotationsDataUpdated: dateTimeValue
      },
  wordID2: {
      lexicalDataUpdated: dateTimeValue,
      annotationsDataUpdated: dateTimeValue
      }
}

The presentation layer (Vue components) tracks changes in those flags and, when the changes affect the data it displayes, it runs a method on Lexis or Annotations to return the updated DTOs.

How does the data flow from the Word object to the presentation layer? The method of the Word object (something like getLexemes()) returns a DTO with the lexemes (or an object that is converted to the DTO in the Lexis and/or Annotations). The lexemes are processed (disambiguated, changed according to annotations) to form the output DTO. How this processing is done depends on the values of the options provided to the getLexemes() method. As a result, different options applied to the same Word would yield DTOs containing different data. The application layer (the Lexis and Annotations modules) then passes the DTOs to the presentation layer.

How would the data updates be handled in this model? Updates are simpler, in a way. Only annotations can be updated, not the lexical data itself. The update of annotations, however, may affect the resulting lexical data DTOs, so the Vue components that display those DTOs would need to pull an updated data.

In order to update annotation, the Vue component in the presentation layer uses a method of the Annotations module from the application layer. Upon receiving such a request the Annotations module gets the corresponding Word object from the Word Repository and executes an updateAnnotations() method of it. The Word object updates the annotations data internally and sends a request to the annotations adapter or the annotations GraphQL API to update data on the remote server. The Word object also publishes data update events that will notify code modules that display this word's data that there was an update and modified data has to be pulled out.

This is how this process looks on the diagram:

There should be specialized methods to change, add, or remove each type of annotation. The method's argument should be specialized DTOs containing the word ID and the data describing the change in the annotations to be made. So we'll need to have multiple annotations input DTOs, each for a specific type of operation.

That's what I think might work for the purpose. It should be flexible and extendable, but I've also tried not to overcomplicate it. It may use many objects of the existing infrastructure and requires only smaller amount of newer objects to be created. Some details are probably missing but I think we'll be able to figure them out once this will go into implementation.

I would greatly appreciate your feedback on this.

I think this approach makes perfect sense.

Actually, the Word may not need to hold the lexemes objects at all. It may store results of the lexical queries returned by the client adapters in form of plain JSON-like objects. But the Word could provide a method with options that, when called, will return a set of lexemes (a DTO) that will be constructed out of those lexical query results according to the user options provided to the method. This lexemes DTOs will then be used to show lexical data to the user in the UI. The lexemes DTOs might be cached using a combination of word, language, context, and options as a key in order to avoid duplicate method calls. DTOs exist in the presentation layer.

This is an important point. One of the difficulties we have right now, when we have only one source of annotations impacting the DTOs, is that there are interdependencies between the components of a DTO that are needed to be taken into account in order to construct the DTO that is displayed to the user. For example, a inflection can impact a decision about whether a lexeme is equivalent to another and needs to be merged with it. As we increase the number of data sources, the possible permutations will only grow. I think that storing the results from adapter queries as plain (but normalized) JSON objects within the Word repository would probably make it easier to deal with this.

alpheios-project / documentation

Annotation Data Design #40