Annotation Data Design - Githubissues

alpheios-project / documentation

Alpheios Developer Documentation

0 stars 0 forks source link

Annotation Data Design #40

Open balmas opened 3 years ago

balmas commented 3 years ago

Related discussions: #33 #38 #24

See also https://github.com/alpheios-project/documentation/blob/master/development/lex-domain-design.tsv which defines some domain requirements for creation of data in the data store

Data design needs to accommodate:

linking of all known Alpheios (both user and alpheios supplied) data about a lexical entity
a wide variety of lexical entity types
a wide variety of languages
ability for data to be public or private
ability to consume data from and contribute data to LOD sources such as the the Lila Lemma Bank (https://lila-erc.eu/sparql/) the Latin Wordnet (https://latinwordnet.exeter.ac.uk/), the Greek Wordnet etc.
standard lexical ontologies
user contribution of both data entities and relationships between entities
user comments on any entity, relationship or comment
ability to filter data by a wide variety of criteria (including its provenance)

balmas commented 3 years ago

Proposed design:

Lexical entities are nodes with properties specific to the entity type
Relationships are represented as edges between nodes; the edges themselves can (1) carry additional properties and (2) be nodes in other relationships
Distinct relationships should only be expressed once (i.e. there is no reason to say wordX hasLemma lemmaY more than once) but assertions of the correct/incorrect nature of that relationship can be many, are specific to the user making the assertion, may be public or private. It will be up to the consumer to decide the meaning of multiple assertions that disagree based upon characteristics such as who asserted it, how many other people asserted it, what degree of confidence a user has assigned to an assertion, etc.
A relationship requires an assertion of its truth (or falseness) to be published and used.

annotationdatamodelgraph

annotationentities

(Entity details still very much a work in progress)

balmas commented 3 years ago

Some of the known issues with the current Alpheios morphology service engines can serve as a set of use cases for annotations on morphological data.

Reference: alpheios-project/morphsvc#38

Summary: Whitaker Engine of Morph Service reports the lemma of 'afore' as 'afore'. While it's possible that 'afore' is an accepted lemma variant of 'absum' our inflection table and full definition for this verb is keyed off of the lemma 'absum' as the "canonical" lemma

Entity Nodes:

[Lemma:afore]
[Lemma:absum]
[User:alpheios]

Edges:

[Lemma:afore isLemmaVariant Lemma:absum]@prefer=Lemma:absum
[User:alpheios assertsTrue [afore isLemmaVariant absum]]

Sample Query: https://gist.github.com/balmas/e7e0e6bc16f2501f3ca06f7462203f70 Sample Response: https://gist.github.com/balmas/ff9ae018feaccfb8fbda4dff618bf4a8

Reference: alpheios-project/morphsvc#29

Summary: Whitaker Engine of Morph Service is missing the identification of the vocative case as a possible inflection of the form senatu of the lemma senatus

Entity Nodes:

[Lemma:senatus]
[Inflection:Form=senat-u|Case=vocative]
[User:alpheios]

Edges:

[Inflection:Form=senat-u|Case=vocative canBeInflectionOf Lemma:senatus]
[User:alpheios assertsTrue [Inflection:Form=senat-u|Case=vocative canBeInflectionOf Lemma:senatus]]

Sample Query: https://gist.github.com/balmas/f6e55dc3b3551a60d034ef131798ba4d Sample Response: https://gist.github.com/balmas/623f26e6dc5abbb43e5646b6658bdfd8

Reference: alpheios-project/morpheus#28

Summary: Morpheus Engine of Morph Service parses τίνος with the Lemma τίς specifying pos of irregular (along with a parse of a demonstrative pronoun). The irregular lemma should be the interrogative pronoun τίς with one genitive singular inflection

Entity Nodes:

[Lemma:τίς|pos=X]
[Lemma:τίς|pos=PRON]
[Word:τίνος]
[Inflection:Form=τίνος|Case=Genitive|Number=Singular]
[User:alpheios]

Edges:

[Word:τίνος hasLemma Lemma:τίς|pos=X]
[Word:τίνος hasLemma:τίς|pos=PRON]
[Inflection:Form=τίνος|Case=Genitive|Number=Singular canBeInflectionOf Lemma:τίς|pos=PRON]
[User:alpheios assertsFalse [Word:τίνος hasLemma Lemma:τίς|pos=X]]
[User:alpheios assertsTrue [Word:τίνος hasLemma:τίς|pos=PRON]]
[User:alpheios assertsTrue [Inflection:Form=τίνος|Case=Genitive|Number=Singular canBeInflectionOf Lemma:τίς|pos=PRON]]

Sample Query: https://gist.github.com/balmas/ecc9db3da04fbf32d3e0f8efdf6b2774 Sample Response: https://gist.github.com/balmas/2639a0e14248e6da6cc98905cfd643cd

Reference: alpheios-project/morpheus#32

Summary: Morpheus Engine of Morph Service doesn't parse the word μεμνήμεθα because it only recognizes this word by the alternate spelling μεμνῄμεθα.

Entity Nodes:

[Word:μεμνῄμεθα]
[Word: μεμνήμεθα]
[User: alpheios]

Edges:

[Word:μεμνῄμεθα isSpellingVariant Word:μεμνήμεθα]
[User:alpheios assertsTrue [Word:μεμνῄμεθα isSpellingVariant Word:μεμνήμεθα]]

Sample Query: https://gist.github.com/balmas/f402883b85041e5227737509be6adce3 Sample Response: https://gist.github.com/balmas/e52f54a2f5e32adf92d8738c4f195dde

(Additional use cases from the morph bugs can be found at https://docs.google.com/spreadsheets/d/1ej-7dAntWQZVASg7aQp0P-PRo2u9Nkn3LYChclYtDVo/edit?usp=sharing)

irina060981 commented 3 years ago

I need more time to read and understand the idea, will work on it tomorrow.

kirlat commented 3 years ago

Need time to study this and mull over it as well

irina060981 commented 3 years ago

About the data structure:

If I understood right, the structure has two main entities - Word (I believe that in terms of Alpheios extension it is TargetWord) and User, all other entities are around them:

lexical data with relationships from various source
alignment data
comments data

And here I have some questions:

A User could have the following roles according to Words:
- a word could be inside his WordList (owner type1)
- a word could be inside Aligned Groups in Alignment (owner type 2)
- a word (or some sub-data connected to this word) could be the target of his comments (contributor)
How this roles are defined in the Model? Where would be defined user's rights for the words, tokens and comments? May be it is worth to add additional entity - UserRole - and attach a userRole to appropriate type/domain of data? Such a division could create an early data separation for various requests and helps with perfomance.
We have an important property of each word - language, in some word sub-entities we could have even ywo defined languages. In future we could think about the fact that any User could have its own language (defined in browser for example). And we have some specific language separations - inflections, word usage and etc. And most times specific users would work only with a specific languages. What do you think about adding an additional division by language - to reduce an amount of all words to Word+Language pieces?
About Alignment structure data - the schema has a direct relation between Word and AlignedGroup, but in the application it has a middle object Token, that has a text property with the word . And Token could be included in several AlignedGroups. Do you suggest to reduce here relationships somehow?

According to the examples I think it is a worthy structure, but it is really difficult to see how it would work with all languages in GraphQL bparadigm for me.

balmas commented 3 years ago

These are really good questions.

If I understood right, the structure has two main entities - Word (I believe that in terms of Alpheios extension it is TargetWord) and User, all other entities are around them:

lexical data with relationships from various source

alignment data

comments data

I think it's not really true that Word and User are the main entities. There can be relationships that don't involve either of these -- for example inflectionA canBeInflectionOf lemmaA and so on.

The connection points to Alpheios applications will in many cases be specific to a User and a Word, but they are not the core of the data model.

User does have a somewhat special place in the model though, because it is through a User's assertion that a relationship between entities is True or False that makes the data usable.

But it isn't true that a user of Alpheios will only have access to data that is connected to the data that is asserted as true or false by their User id. We will have to give users control over what data they do and do not see, based upon who asserted it. And we also have to give users control over whether the data they create is available to other users. For now, we have decided that there are 2 possibilities: public or private. In the future it is very likely we will need to be able to express finer-grained levels of access - such as group-level, site-level etc. But to start we are going to support these two.

So for example, as the "alpheios.net" user, we may publish corrections to the results of the morphological parsers as annotations (these are the use cases I've described above). The assertions of their truth will be made by the "alpheios.net" user (exact identifier TBD) and they will be available to anyone.

On the client-side, a user will have the choice of which data to retrieve -- they will be able to say, for example, 'give me all data asserted by alpheios.net and by myself, and no other' or 'give me all data that is publicly asserted, but exclude data that is asserted by userx'. Or maybe even 'give me all data asserted by alpheios.net and myself, plus any data that is public and which has been asserted X number of times' (the implicit assumption being that the more people agree with a statement the more it is likely to be correct).

And here I have some questions:

A User could have the following roles according to Words:

a word could be inside his WordList (owner type1)

a word could be inside Aligned Groups in Alignment (owner type 2)

a word (or some sub-data connected to this word) could be the target of his comments (contributor)

How this roles are defined in the Model? Where would be defined user's rights for the words, tokens and comments? May be it is worth to add additional entity - UserRole - and attach a userRole to appropriate type/domain of data? Such a division could create an early data separation for various requests and helps with perfomance.

I need to think more about this. For the moment, there are 2 roles identified in the data model - (1) creator of data (the user that put the data into the database) and (2) subject or object of an assertion (as in User X assertsTrue Relationship Y). At the moment, I'm thinking that access restrictions would be tied only to the former, and the isPublic property on the assertion is what gates the access to it.

In this approach, it means that we CANNOT make the data or a query endpoint to it itself publicly available, all access would have to be gated through queries that enforce the access restriction rules. This is how the wordlist works currently although it's not a graph-based model.

There are certainly other roles with respect to data that it might be good to record, some of which get quite philosophical (who actually "created" the words of a text that is aligned? the author of the text?).

I should also explain that I don't right now at least envision making this database the source of truth for User information. I would like to keep that data separate, as it is much more sensitive. I think we will continue to use Auth0 for our User database, and at a minimum, just the opaque userId would be available in this new shared data store. Additional information about themselves that a user chooses to make public might be retrieved from the Auth0 database at runtime, or synced, depending upon performance. But I want to keep user-identifying data separate from this new data store as much as possible. We WILL have to make enhancements to our use of Auth0 to support this to allowing users to enter and edit profile information.

We have an important property of each word - language, in some word sub-entities we could have even ywo defined languages. In future we could think about the fact that any User could have its own language (defined in browser for example). And we have some specific language separations - inflections, word usage and etc. And most times specific users would work only with a specific languages. What do you think about adding an additional division by language - to reduce an amount of all words to Word+Language pieces?

I though a lot about this and I'm not sure what makes the most sense. For the initial modeling, I used a document property for language to limit the proliferation of document collections as I was still trying to figure out what they should be, but it could definitely be better to have individual collections of document per language. It's something that I think we'll have to analyze the performance of different queries to decide. The indexing options in Arangodb are pretty flexible and we can index on property values, but it does definitely introduce potential error into the data using a language property unless we enforce a schema on its values.

About Alignment structure data - the schema has a direct relation between Word and AlignedGroup, but in the application it has a middle object Token, that has a text property with the word . And Token could be included in several AlignedGroups. Do you suggest to reduce here relationships somehow?

I think Token probably does need to be represented because it isn't exactly the same as Word -- i.e. multiple Tokens could be combined to make up a single Word. In order to make that connection though, we would have to retain that information from the tokenizer. It's something that needs to be thought through more.

kirlat commented 3 years ago

Thanks for the diagrams and the detailed description. I have several small questions:

What does a Lemma below the other Lemma designate?

Does the lower lemmf is a lemma variant of the lemma above?

Can we say that connections between the nodes (edges) are not set once and for all but are flexible, as they can be edited, added, and changed in other ways during the lifetime of the lexical data piece? Would we store nodes and edges in a "disassembled" state with some links that represent "glue points" between them and then "assemble" them into a graph that will depend on the user dynamically (i.e. different users will be presented with a different version of a graph)? Or do we plan to build one huge graph and then store it in the database? Or would it be something in between? I think we cannot store everything in our own database and we have to provide some way to link to external resources that are outside of our control, such as Tufts morphology.
Will data access be based on a GraphQL API? I think it's really powerful and fits naturally into how the relationships are constructed.

I like the idea of separating nodes (entities) and edges (connections) very much. It creates a strong concept and a meaningful vocabulary to represent the concepts of handling the lexical data.

Many details of the implementation are not defined yet but I think there is already a solid foundation that we can use to move forward.

balmas commented 3 years ago

What does a Lemma below the other Lemma designate? ... Does the lower lemmf is a lemma variant of the lemma above?

Yes, in this case it's showing 2 separate Lemma nodes, connected by the isLemmaVariant edge

Can we say that connections between the nodes (edges) are not set once and for all but are flexible, as they can be edited, added, and changed in other ways during the lifetime of the lexical data piece?

I think we will need to have delete protection on anything that can be pointed at, which includes edges (that can function as nodes). Otherwise we will end up with meaningless assertions of the truth or falseness of a relationship. I think this may also mean that edges cannot be edited substantially once they are created, otherwise it would put into doubt the validity of the assertions that point at them. I think anything that can be pointed at might need to be in a frozen state once it participates in a relationship.

Would we store nodes and edges in a "disassembled" state with some links that represent "glue points" between them and then "assemble" them into a graph that will depend on the user dynamically (i.e. different users will be presented with a different version of a graph)?

Yes, I think the graphs will be built dynamically based upon the query.

Or do we plan to build one huge graph and then store it in the database? Or would it be something in between? I think we cannot store everything in our own database and we have to provide some way to link to external resources that are outside of our control, such as Tufts morphology.

Definitely we cannot store everything that might ever be part of the graph, but we need to store them once they become part of the graph. That is, it's the point at which someone annotates a relationship that is asserted by an external resource that that relationship (and the nodes in it) will get added to the database. When we have a persistent IRI for an external resource (which right now is rare) we should use it. We can also use properties to identify the original source of data (see for example the lemma properties in the sample query at https://gist.github.com/balmas/f6e55dc3b3551a60d034ef131798ba4d where I am specifying that the data I'm looking for annotations on has come from the "net.alpheios:tools:wordsxml.v1" source.)

lemma:{
          representation:"senatus",
          lang: "lat",
          pos: NOUN,
          source: "net.alpheios:tools:wordsxml.v1"
          principalParts: ["senatus"]
        },

Will data access be based on a GraphQL API? I think it's really powerful and fits naturally into how the relationships are constructed.

Yes, I have some naive first attempts at the api in the Gists I've linked to above in the sample use cases.

kirlat commented 3 years ago

I think we will need to have delete protection on anything that can be pointed at, which includes edges (that can function as nodes). Otherwise we will end up with meaningless assertions of the truth or falseness of a relationship. I think this may also mean that edges cannot be edited substantially once they are created, otherwise it would put into doubt the validity of the assertions that point at them. I think anything that can be pointed at might need to be in a frozen state once it participates in a relationship.

I'm afraid that the protection deletion would be pretty hard to manage. I've started to think if we could find a way around it without imposing and maintaining such restrictions.

I might be wrong, but it seems to me that the nodes are more "stable" pieces than the edges in the lexical data model. If there is a lexeme, or an inflection, their existence is probably a more-or-less reliable truth. The question usually arises around whether a particular inflection belongs to a particular lexeme, or several lexemes (it might in theory not belong to anything at all if it is considered incorrect), i.e. if there should be a relationship (an edge) between the one and the other. Someone may say that "A is an inflection of lexeme B". This statement not only asserts the relationship, but also establishes a relationship itself (creates an edge) between the "inflection A" and the "lexeme B" (if such edge was not already established by the other assertion before). The relationship is solely based on the assertion, if we can put it this way. If the assertion will be revoked, relationship should be destroyed too.

I'm wondering if, in situations like this, it would make sense to store relationship along with the corresponding assertions. If the assertion is edited, the relationship is edited too. If the assertion is destroyed, the relationship should cease to exist too.

It's like storing parts of a graph separately. When we construct a graph for a specific word, we can check if there are any relationships for its lexemes that are part of it, and if there are, a final graph will reflect that.

I think this will make management of relationships more flexible. If we want to remove or edit a relationship or an assertion, we can do it all in one place. There will be no need to "lock" the whole graph or the parts of it. I think this way it should be easier to manage the lexical data than when it's all in one complex graph.

Does such an approach make sense? What do you think?

kirlat commented 3 years ago

The service for the decentralized annotations publishing is dokieli. It uses, among other, the following technologies:

Linked Data Platform
RDFa
Web Annotation Data Model
JSON-LD
Turtle can be useful too. I think these technologies were created to solve problems that we're trying to solve. As those are established standards I think it would make sense for us to use these technologies whenever appropriate.

We might also look at Solid as it is within pretty much the same problem domain as wll.

balmas commented 3 years ago

We should also look at https://www.hypergraphql.org/ when designing the GraphQL API

balmas commented 3 years ago

https://linkeddatafragments.org/ is relevant to suggestions from both @irina060981 and @kirlat

balmas commented 3 years ago

annotationmodelgraphtake2

Revised sample use cases:

Reference: alpheios-project/morphsvc#38

LexicalEntity Nodes:

{ IRI:"https://alpheios.net/data/lemma/lat/afore" 
  type:"lemma", 
  lang: "lat",
  representation: "afore", 
  pos: "VERB", 
  source: "net.alpheios:tools:wordsxml.v1",
  creator: "net.alpheios"
}
{ IRI:"https://alpheios.net/data/lemma/lat/absum" 
  type: "lemma", 
  lang: "lat",
  representation: "absum", 
  pos: "VERB", 
  source: "net.alpheios:tools:wordsxml.v1",
  creator: "net.alpheios"}

LexicalEntityRelation Edges:

{ _from: "https://alpheios.net/data/lemma/lat/absum", 
    _to: "https://alpheios.net/data/lemma/lat/afore" 
    type: "isLemmaVariant" 
    creator:"net.alpheios"
    prefer: "https://alpheios.net/data/lemma/lat/absum"
}

User Collection (not part of the graph}

{ IRI: 'net.alpheios' }

Prototype GraphQL Query: https://gist.github.com/balmas/e7e0e6bc16f2501f3ca06f7462203f70 Prototype GraphQL Response: https://gist.github.com/balmas/ff9ae018feaccfb8fbda4dff618bf4a8

Reference: alpheios-project/morphsvc#29

Summary: Whitaker Engine of Morph Service is missing the identification of the vocative case as a possible inflection of the form senatu of the lemma senatus

Entity Nodes:

{ IRI:"https://alpheios.net/data/lemma/lat/senatus" 
  type:"lemma", 
  lang: "lat",
  representation: "senatus", 
  pos: "NOUN", 
  source: "net.alpheios:tools:wordsxml.v1"
  creator: "net.alpheios"}

{ IRI:"https://alpheios.net/data/infl/lat/senatuvoc" 
  type:"inflection", 
  lang: "lat",
  form:senatu", 
  stem: "senat", 
  suffix:"u", 
  udfeatures: { 
    Case: "vocative" 
  },
  source: "net.alpheios:tools:wordsxml.v1",
  creator: "net.alpheios"
}

Edges: LexicalEntityRelation Edges:

{ _from: "https://alpheios.net/data/infl/lat/senatuvoc", 
    _to: "https://alpheios.net/data/lemma/lat/senatus", 
    type: "isInflectionOf", 
    creator:"net.alpheios"
}

User Collection (not part of the graph}

{ IRI: 'net.alpheios' }

Sample Query: https://gist.github.com/balmas/f6e55dc3b3551a60d034ef131798ba4d Sample Response: https://gist.github.com/balmas/623f26e6dc5abbb43e5646b6658bdfd8

Reference: alpheios-project/morpheus#28

Entity Nodes:

{   IRI:"https://alpheios.net/data/lemma/grc/tisx"
    type: 'lemma',
    representation: 'τίς',
    lang: 'grc',
    pos: 'X',
    langpos: 'irregular',
    source: 'net.alpheios:tools:morpheus.v1',
    creator: 'net.alpheios',      
}

{   IRI:"https://alpheios.net/data/lemma/grc/tis"
    type: 'lemma',
    representation: 'τίς',
    lang: 'grc',
    pos: 'PRON',
    source: 'net.alpheios:tools:morpheus.v1',
    creator: 'net.alpheios',
}

{  IRI: "https://alpheios.net/data/inflections/grc/tinosgensing"
   type: 'inflection',
   form: "τίνος",
   stem: "τίνος",
   udfeatures: {
        Case: 'genitive',
        Number: 'singular'
    },
    xfeatures: {
        stemtype: 'inter',
        morphtype: 'enclitic indeclform'
    }
}

{   IRI: "https://alpheios.net/data/words/grc/tinos"
    type: 'word',
    representation: 'τίνος',
    lang: 'grc',
    creator: 'alpheios.net' 
}

LexicalRelation Edges:

{
   _from: "https://alpheios.net/data/lemma/grc/tisx",
    _to: "https://alpheios.net/data/words/grc/tinos",
   type: 'isNotLemmaOf',
   isPublic: true,
   confidence: 1,
   creator: 'net.alpheios',
}
{
   _from: "https://alpheios.net/data/lemma/grc/tis",
    _to: "https://alpheios.net/data/words/grc/tinos",
    type: 'isLemmaOf',
    isPublic: true,
    confidence: 1,
    creator: 'net.alpheios',
}
{
   _from: "ttps://alpheios.net/data/inflections/grc/tinosgensing",
    _to: "https://alpheios.net/data/lemma/grc/tis",
    type: 'isInflectionOf',
    isPublic: true,
    confidence: 1,
    creator: 'net.alpheios'
}

Sample Query: https://gist.github.com/balmas/ecc9db3da04fbf32d3e0f8efdf6b2774 Sample Response: https://gist.github.com/balmas/2639a0e14248e6da6cc98905cfd643cd

Reference: alpheios-project/morpheus#32

Summary: Morpheus Engine of Morph Service doesn't parse the word μεμνήμεθα because it only recognizes this word by the alternate spelling μεμνῄμεθα.

Entity Nodes:

  {
      IRI: 'https://alpheios.net/data/words/grc/memenealt1',
      type: 'word',
      representation: 'μεμνῄμεθα',
      lang: 'grc',
      createdBy: 'alpheios.net' 
}
{
      IRI: 'https://alpheios.net/data/words/grc/memenealt2"
      type: 'word',
      representation: 'μεμνήμεθα',
      lang: 'grc',
      createdBy: 'alpheios.net'     
}

LexicalRelation Edges:

{ 
  _from: "https://alpheios.net/data/words/grc/memenealt2",
   _to: "https://alpheios.net/data/words/grc/memenealt1"
   type: 'isSpellingVariant',
   isPublic: true,
   confidence: 1,
   creator:'net.alpheios'
}

Sample Query: https://gist.github.com/balmas/f402883b85041e5227737509be6adce3
Sample Response: https://gist.github.com/balmas/e52f54a2f5e32adf92d8738c4f195dde

@irina060981 and @kirlat thank you both for your feedback and for talking me off the complexity ledge :-)

Above is a revised approach to the data model, based upon your suggestions and the additional reading mentioned above.

A few things to point out:

1) In this scenario, we have the potential for many edges that essentially say the same thing, asserted by different authorities. I think ideally the query implementation would dedupe and aggregate them, but it will be up to the client to decide how to handle conflicting assertions.

2) I am not at all sure of what ontologies we will use for all of the properties of the nodes and relationship edges. It will likely be a combination of mostly existing ontologies and a few alpheios-specific vocabulary items. For the most part I think we will use dublin core terms, the ontolex ontology and olia but we will have to fill in some gaps.

3) the diagram tries to show the abstract concept of nodes and edges, examples of potential concrete but database-agnostic implementation, and database-specific details. (I always pack too much into my diagrams, I know). So, for example, in the implementation details, _key, _id, _from, _to are special properties in ArangoDB.

4) In response to the suggestion about separating by language, I would like to start by having language-specific collections of the node documents, but keeping the edge collections language agnostic. I will keep language properties on the node documents, however, because I believe they should be able to stand on their own outside of the database structure. If performance testing indicates that we need to further breakdown the edges by language we can, but I think since we are not normally starting queries from an edge, but a node, I'm not sure there would be a benefit in doing so.

5) I think conceptually, it makes sense to have separate edge collections for the nature of a relationship. So here I am proposing an edge collection for LexicalRelationships where the edge documents themselves would specify the type of the relationship in a property. We can index on relationship type to facilitate query performance, and if that isn't sufficient we can break it down further. I have also proposed an edge collection for "attestedAt" relationships, which are essentially the identification of a specific lexical entity in a specific context. We might also have one for the commentingOn relationship, one for the relationship between parts of alignment nodes, etc. Those all still need to be fleshed out.

6) I have greatly reduced the doubling up of edges as nodes, but not completely. I think we need to be able to support comments on edges. I think there are some clear separations of the different graph types, and so we can have a graph of comments and a graph of lexical entities and a graph of alignments and they can have intersection points but not all be in the same graph.

7) I have removed the User from the graph model for now, opting instead to use a reference to the user in the createdBy property of both Node and Edge documents. I want to be able to associate a User with both an edge and a node within the same graph, and there isn't a graph-like way to do that that I can think of that doesn't break the graph. It might also make sense to some point to have completely separate User collections for data (similar to the language-specific collections of nodes). In LDP terms, the User is probably more a container than either a node or an edge.

8) The change in data model in my prototype implementation hasn't really affected the GraphQL API at all. The handling of the negated lemma is a little different (it's an assertion on the word now rather than on the lemma which is maybe more accurate anyway).

9) apologies that the prototype details in the prototype JSON data objects above don't match exactly the prototype details in the diagram or the GraphQL output. As this is all just prototype for discussion, it's a little messy but I'm hoping you get the gist.

10) I still need to think about how we will handle versioning of the data.

11) finally, although not outlined above, here's a little sample of what a graph of a disambiguation scenario would look like (e.g. an assertion that a word in a particular context has a specific lemma and inflection):

0944ae62-6e07-43c5-92aa-31b9a9f4898d

Also, in case it helps with understanding this, my code for the ArangoDB prototype where I have been working through all of this is at https://github.com/alpheios-project/arangodb-svcs

kirlat commented 3 years ago

Thanks for the detailed description! I like the new model, I think it's much more flexible and extendable now. A few comments to it:

I am not at all sure of what ontologies we will use for all of the properties of the nodes and relationship edges. It will likely be a combination of mostly existing ontologies and a few alpheios-specific vocabulary items. For the most part I think we will use dublin core terms, the ontolex ontology and olia but we will have to fill in some gaps.

I think it does not matter what ontology do we use as long as we specify the ontology IRI along with the ontology terms. This will provide client with the reference to the ontology and make things non-ambiguous. I think this, even being more verbose, will give us flexibility to use any ontologies we want without any limitations.

In response to the suggestion about separating by language, I would like to start by having language-specific collections of the node documents, but keeping the edge collections language agnostic. I will keep language properties on the node documents, however, because I believe they should be able to stand on their own outside of the database structure. If performance testing indicates that we need to further breakdown the edges by language we can, but I think since we are not normally starting queries from an edge, but a node, I'm not sure there would be a benefit in doing so.

I agree with keeping the edges language agnostic. Edge is a connection between two nodes and it does not, on my opinion, "belong" to any language on itself as a word, an entity that caries some language-specific data, does.

I think we can use language-based indexes to group nodes into language-based collections. Edges can be included into such collections based on nodes it connects: if those nodes do belong to a certain language, edges can be included to collections for that language too.

I have greatly reduced the doubling up of edges as nodes, but not completely. I think we need to be able to support comments on edges. I think there are some clear separations of the different graph types, and so we can have a graph of comments and a graph of lexical entities and a graph of alignments and they can have intersection points but not all be in the same graph.

I think we can also have comments on edges being shown on the lexical entities graph as well if we will introduce edges as entities in the query. We can have something like:

word {
   isSpellingVariang { // An edge included into the diagram
        word { // Information on the related word
          representation
        }
        comment // A comment on the "isSpellingVariang" edge
   }
   representation // Information on the main word
}

I've seen this approach used in GraphQL queries on many occasions, including the popular Gatsby generator. We can do something along those lines.

I have removed the User from the graph model for now, opting instead to use a reference to the user in the createdBy property of both Node and Edge documents. I want to be able to associate a User with both an edge and a node within the same graph, and there isn't a graph-like way to do that that I can think of that doesn't break the graph.

I think we can add user information to the edge using the approach shown above. Extending the example above, we could have something like:

word {
   isSpellingVariang { // An edge included into the diagram
        word { // Information on the related word
          representation
        }
        comment // A comment on the "isSpellingVariang" edge
        user {
          name // A name of the user who asserted the relationship
        }
   }
   representation // Information on the main word
}

Or if there are multiple users, we can show an array of users using the plural users filed. It will contain an array of user entities instead of a singular user one.

I still need to think about how we will handle versioning of the data.

Maybe we can use a timestamp as a version? In that case, if we would like to assemble the latest version of a graph, we'll use entities with the most recent timestamp. If we would like to go back to the previous version, we can specify a certain point in time and assemble a graph from entries that have a timestemp below that date. Of course that would not work if we would need to establish specific snapshots that are not time-synced (such as a special version of a graph for some particular purpose).

kirlat commented 3 years ago

I'm thinking if our GraphQL queries may be simpler if we introduce edges into them For example, if to take a request from https://gist.github.com/balmas/e7e0e6bc16f2501f3ca06f7462203f70 and change it to something like:

query {
  wordAnnotations(...) {
     lemmaVariants { // Renamed from 'isLemmaVariant' to 'lemmaVariant' and used in plural form to make it more readable; this represents an edge requested (i.e. the type of relationships)
      lemma { // This is what information we want to get for this relationship
         IRI,
         lang,
         representation, 
         pos, 
         source
         creator
      }
    } 
  }
}

Then the response might be something like:

data: {
    wordAnnotations: {
      lemmaVariants [
         lemma { 
            IRI: "https://alpheios.net/data/lemma/lat/afore"  
            lang: "lat",
            representation: "afore", 
            pos: "VERB", 
            source: "net.alpheios:tools:wordsxml.v1",
            creator: "net.alpheios"
       },
      lemma {
         // Some other lemma variant
      }
    ] 
  }
}

I might mess up some syntax and details but hope the code above conveys the idea.

What do you think? Would it work for us? Would it make things simpler?

kirlat commented 3 years ago

I also have a question about query parameters of the sample query https://gist.github.com/balmas/e7e0e6bc16f2501f3ca06f7462203f70#file-gistfile1-txt-L3-L26 Do I understand correctly that their purpose is to set a filter on what word annotations to be returned?

I'm thinking if it would make sense to use a less formal, but a simpler approach and specify only the fields whose values would serve as an actual filter only? Something like:

wordAnnotations(
  word: "afore",
  lang: "lat",
  pos: VERB,
  ... // Other values that would serve as a filter
)

This would be consistent with usage examples I've seen and will make the query simpler. What do you think? Does it make sense?

balmas commented 3 years ago

I think we can also have comments on edges being shown on the lexical entities graph as well if we will introduce edges as entities in the query. We can have something like:

Yes, I agree -- this is the nice thing about the GraphQL api being separate from the database implementation. Even if in the database the comments are in a separate graph from the data they comment on, we can present them as a single graph in the GraphQL API.

balmas commented 3 years ago

Maybe we can use a timestamp as a version? In that case, if we would like to assemble the latest version of a graph, we'll use entities with the most recent timestamp. If we would like to go back to the previous version, we can specify a certain point in time and assemble a graph from entries that have a timestemp below that date. Of course that would not work if we would need to establish specific snapshots that are not time-synced (such as a special version of a graph for some particular purpose).

I agree timestamps make sense as a way to identify versions. I am not sure how many past versions of data we want to keep available in the graph though. I like the approach of having a non-versioned IRI for data that always returns the latest version, and referencing the prior versions in the the data (per the approach described in http://lrec-conf.org/workshops/lrec2018/W23/pdf/2_W23.pdf) but I don't think we will keep unlimited versions of all data points. This may not be too big of an issue though, because for the most part, at least for the lexical data, we are talking about very small data objects that likely won't change once created. We could also have different status for data such as draft and published, and only allow referencing published data.

balmas commented 3 years ago

I'm thinking if our GraphQL queries may be simpler if we introduce edges into them For example, if to take a request from https://gist.github.com/balmas/e7e0e6bc16f2501f3ca06f7462203f70 and change it to something like:

I think this is an interesting suggestion. The type of relationships (edges) that we might query will be many and will grow over time and the same input will feed into many of them. I think we can use variables in GraphQL to keep the queries concise in case (i.e. to keep from repeating the same expanded lemma object over and over). But I think your suggestion is very much in line with the approach outlined in the linked fragments proposal in that it puts in the hands of the client to know exactly what it is asking for.

balmas commented 3 years ago

I also have a question about query parameters of the sample query https://gist.github.com/balmas/e7e0e6bc16f2501f3ca06f7462203f70#file-gistfile1-txt-L3-L26 Do I understand correctly that their purpose is to set a filter on what word annotations to be returned?

I'm thinking if it would make sense to use a less formal, but a simpler approach and specify only the fields whose values would serve as an actual filter only? Something like:
wordAnnotations(
  word: "afore",
  lang: "lat",
  pos: VERB,
  ... // Other values that would serve as a filter
)
This would be consistent with usage examples I've seen and will make the query simpler. What do you think? Does it make sense?

I think generally I agree with you. I'm a little uncertain about the example though. In the data model, word is an abstract object (based upon the Ontolex ontology https://www.w3.org/community/ontolex/wiki/Final_Model_Specification), with a property "representation" that contains the actual letters that make up the written representation of the word. We can hide that detail from the client of course in the GraphQL api, but that's why "representation" is there along with "pos" and "lang" which are also properties.

kirlat commented 3 years ago

I've added #42 for the discussion of the annotation UI concepts.

balmas commented 3 years ago

See #43 for discussion of PIDs for data objects.

balmas commented 3 years ago

note that we might want to consider TEI Lex-0 as a possible export format for the lexical data. https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html#

kirlat commented 3 years ago

I've started to work on the implementation of annotations for short definitions and I think our requirements dictate us to change the way we store and serve lexical data within the application. The major driver for the change is the requirement for the user to specify dynamically what annotations should be applied to the data model displayed in the UI.

I think the best way to achieve this is to move from static props to methods that return data dynamically, based on user preferences supplied to it. Let's take a DefinitionSet class, for example. Now it has a shortDefs prop, an array. It should, I think, be replaced with a getShortDefs(options) method. The options argument mentioned in this example is the object with parameters that specifiy what annotations and corrections should be taken into consideration (author's only, alpheios.net only, both, etc.). This method will return an array of short defs,, but the information it will contain will be defined by the user preferences passed via the options object.

It seems to be very similar by the approach to the GraphQL where any field requested may have options that would specify what data should be returned and how it must be filtered.

If to accept the approach above, the following implementation would make sense, on my opinion. Upon a lexical query request for the target word a word object containing all information related to the target word specified is returned from the GraphQL facade. This information as returned is, maybe with some amends made by the lexical query, stored within the Word JS object. This data is hidden from the user and is not exposed directly. The data stored within the Word object would be represented by both nodes (lexeme, lemma, definition, etc.) and edges (connection, such as "D is a definition of the lexeme L"), as in the GraphQL response.

The user would then use methods of the Word object to retrieve specific lexical entities: getHomonym(), getLexemes(), and others. getLexemes() may be part of the Homonym object returned by the getHomonym() call or, if called on the Word object, would return all lexemes contained within the GraphQL query response. The objects returned by those methods would contain no edges, only nodes (to be compatible with what we have now), but information from the edges would be attached somehow to the nodes, probably, and thus will be accessible by the requester.

In order to avoid data duplication and to let data changes to be reflected in all instances of the return objects, the nodes data has to be references to the "original" objects stored within the instance of the Word.

It could be backward compatible as the "old" props would be combined with the "new" methods within same objects. The annotation-aware code would call the "new" methods while the existing code would use data from the props.

When the user would want to annotate (edit) a connection (an edge), as when saying "D is not a definition of the lexeme L", a method of the Word instance will be called, Its arguments would be IDs of the definition and the lexeme whose relationship is being edited, along with other information about the edit (is that an assertion or a negation, the user who made the change and so on). Based on that, the Word would create an additional edge representing the edit and all other necessary changes to the lexical data within the Word instance.

This data objects that were created as results of the Word methods would be notified about the data updates. The Word object itself will send a GraphQL request to the backend with the information about the edge created. The backend will store it in its database and will include in all consequential requests.

The Word would become some kind of the "mother" object that will be able to produce data in various forms. That data would exist as long as the mother object stays alive (it can be copied if it need to live longer).

Would the approach like this allow us to achieve what we want? I think it's more complex but infinitely more flexible. As we were discussing before that the Homonym is not the right structure on many occasions, the approach like the above would allow us to retrieve data in any form we want, be it a homonym, or a list of lexemes, or something else. We could vary it depending on the situation and user preferences.

What do you think about the approach? Do you see any pitfalls in it?

balmas commented 3 years ago

I agree this is the right direction. In https://github.com/alpheios-project/documentation/issues/37 we also have proposed changes to the data model to introduce the Word as the driving object. I also agree with the suggestion for the props (e.g. DefinitionSet.shortdefs) to become methods that can be used to retrieve a filtered set of data according to user preferences. This all seems to me to be the way we should be going. There may well be pitfalls but we need to expose them as we go!

kirlat commented 3 years ago

It seems the GraphQL lexical data could be stored the following way (it seems very similar to @balmas proposal of storing graph elements):

Nodes are stored within maps grouped by the node type (lexemes, lemmas, definitions, etc.). The map's key is the node index and the value is the node object itself. Map is required to find a node by its index fast.
Each edge would have its own index that could be a combination of the indexes of the nodes it connects. Edges are stored in the map with the key being the edge's index and the value being the edge object with its metadata.
Each node would keep two arrays of edges: from, containing the IDs of edges coming from the node, and to with the list of incoming nodes. That would be flexible to allow us to edit the graph any way we want. It will also allow to traverse the graph in any direction.

balmas commented 3 years ago

If you're talking about storage on the client, we might want to look into whether there are graph structure libraries that would be helpful to us.

(We need also to get back to defining the GraphQL api for the annotations...)

irina060981 commented 3 years ago

From my point of view - it is not the best way to go, I will try to explain my concerns.

For now we have

data models - that described clear data object without dependency on how it was got, who use it and so on; this way it could be used (and sometimes it is used) in different places of applications and in different applications. Data instance are abstract and independent from interfaces that uses it. It doesn't need any aditional dependencies and would be used with/without GrpahQL.

components - the specific interface that works with data models; that has GraphQL, could attach annotations; or could not attach them; who use different data instances - Homonym, WordItem
annotations - it is not a mandatory addon to the data inside remote sources (not to the data models itself); it adds information - corrects or comments different data in different remote source

GraphQL - is a tool to navigate between requests to collect data from remote source and enrich dataModels. We had tools before GraphQL - and in future will have other tools, while code is progressing.

I believe that we should save this cross-independency. Then DataModels would be safely used in other applications (like AlignmentEditor). Also we have WordList workflow with saved Homonym data - that doesn't need any annotations addons there, I believe.

For example, about given example of shortDefinitions.

we have Definition object, it has defined properties like text; it is used in different parts of an application
we need to have additional interface to getDefinitions data with Annotations, or other interfaces:
1. we could create a child class with additional methods, that could have annotations and corrections - if it is stored inside Definitions objects
2. we could create a specific class inside annotations package EnrichedAnnotation that has relations with GraphQL, Annotations and so on, one of its property could have Definition as a property

This way annotations feature would be incapsulated in the annotations package, it would be used only when we use annotations package and would work when we don't need annotations in the code.

The same for GraphQL - I believe that data model should be independent from the way we get data inside data instances.

It should be done the same way as Vuex Store is integrated to the code, it has its own interfaces inside components package, and its usage doesn't influence on other aplications and even on the other parts like Wordlist.

kirlat commented 3 years ago

I believe that we should save this cross-independency. Then DataModels would be safely used in other applications (like AlignmentEditor). Also we have WordList workflow with saved Homonym data - that doesn't need any annotations addons there, I believe.

I think this is a good point. If not all clients of data models need annotations, it should not be in the data model package. I think we should try to keep annotation-related and the "regular" business logic separated, if possible.

I think it's too hard to build such model theoretically so we should probably try to implement it in code keeping in mind a separation of knowledge domains.

kirlat commented 3 years ago

This is how I think we should represent edges. All edges should always represent asserting, not negating, statements, like "D is a definition of a Lexeme L". The reason for this, I think, is that an asserting statement creates a connection. There is no reason to deny something that does not exist in the first place. So the connection should always be created first before any statements can be made about it.

So here is a statement that defines an edge: "D **is a definition** of the Lexeme L".

Then we can start to gather statements about this edge. The statements could be either assertions, confirming that this connection is valid, or the negations, denying the connection's existence. There could be multiple instances of both from various users. The first assertion should probably come from the party that created it (as alpheios.net) and should be created automatically, along with the edge.

Statements could be attached to the connection as metadata:

"D is a definition of a Lexeme L"
    assertions:
        alpheios.net
        The current user
    negations:
        User 1
        User 2
        User 3

So in this case we have two assertions and three negations and under the normal conditions the connection should not be used during the lemma construction: the definition should not be attached to the lexeme. But if the user sest an option to respect only his/her statements, then the definition should be attached to the lexeme: we have 1 assertion versus 0 negations.

When someone creates an assertion or a negation, it will be passed to the GraphQL API in order to be stored as the edge's metadata. If every edge would have its unique ID, that would be easy to do: we would need to pass an assertion/negation and the ID of the edge it should be attached to.

There are also comments. I think we should be able to attach comments to anything that has an ID: to the node, to the edge, or to the other comment (that would allow to create threaded discussions, if necessary).

So if the connection has comments, it would look like the below:

"D is a definition of a Lexeme L"
    assertions:
        alpheios.net
        The current user
    negations:
        User 1
        User 2
        User 3
    comments:
        Comment 1
        Comment 2

Let's say that the User 4 wants to confirm the assertion and create a comment about that. In that case two new objects need to be created: an assertion and a comment. Both of those new objects will be connected to the edge by the edge's ID, nothing else should be needed. The transaction to update the annotation database will send both objects along with the ID of the edge they're attached to. The resulting edge would look like below:

"D is a definition of a Lexeme L"
    assertions:
        alpheios.net
        The current user
        The user 4
    negations:
        User 1
        User 2
        User 3
    comments:
        Comment 1
        Comment 2
        Comment 3 from the user 4

Now let's assume that someone wants to create a new definition for the existing lexeme. In that case we'll need to:

Create a new definition object and assign a new unique ID to it.
Create a new edge: a connection stating that a new definition belong to the existing lexeme. It will hold two IDs of objects it connects: an ID of the existing lexeme and the ID of the newly created definition.
An assertion by the user who created the definition. It will state that the connection is truthful.
(Optional) a comment explaining why this definition should belong to the lexeme. All four of those objects will be sent to the annotation DB backend.

Would something like this work? if so, I will create GraphQL transactions around them.

balmas commented 3 years ago

I believe that we should save this cross-independency. Then DataModels would be safely used in other applications (like AlignmentEditor). Also we have WordList workflow with saved Homonym data - that doesn't need any annotations addons there, I believe.

I think this is a good point. If not all clients of data models need annotations, it should not be in the data model package. I think we should try to keep annotation-related and the "regular" business logic separated, if possible.

I think it's too hard to build such model theoretically so we should probably try to implement it in code keeping in mind a separation of knowledge domains.

Agree this is an interesting point. However, to be clear, it's not just annotations we're talking about here. Another reason for this refactoring is that we need to be able to retrieve resources from a wider variety of sources and let the user choose which to include and how to combine them. But that's also the point of using GraphQL -- it is supposed to address just this use case. We should keep that business logic for combining resources behind the GraphQL facade, but I'm not sure it means that we shouldn't have a method on the data model object to specifically request the data according to the user preferences.

I think it's good to proceed cautiously here and and at each step ask ourselves if we have appropriate separation of concerns.

balmas commented 3 years ago

This is how I think we should represent edges. All edges should always represent asserting, not negating, statements, like "D is a definition of a Lexeme L". The reason for this, I think, is that an asserting statement creates a connection. There is no reason to deny something that does not exist in the first place. So the connection should always be created first before any statements can be made about it.

So here is a statement that defines an edge: "D **is a definition** of the Lexeme L".

Then we can start to gather statements about this edge. The statements could be either assertions, confirming that this connection is valid, or the negations, denying the connection's existence. There could be multiple instances of both from various users. The first assertion should probably come from the party that created it (as alpheios.net) and should be created automatically, along with the edge.

Statements could be attached to the connection as metadata:
"D is a definition of a Lexeme L"
    assertions:
        alpheios.net
        The current user
    negations:
        User 1
        User 2
        User 3
So in this case we have two assertions and three negations and under the normal conditions the connection should not be used during the lemma construction: the definition should not be attached to the lexeme. But if the user sest an option to respect only his/her statements, then the definition should be attached to the lexeme: we have 1 assertion versus 0 negations.

When someone creates an assertion or a negation, it will be passed to the GraphQL API in order to be stored as the edge's metadata. If every edge would have its unique ID, that would be easy to do: we would need to pass an assertion/negation and the ID of the edge it should be attached to.

There are also comments. I think we should be able to attach comments to anything that has an ID: to the node, to the edge, or to the other comment (that would allow to create threaded discussions, if necessary).

So if the connection has comments, it would look like the below:
"D is a definition of a Lexeme L"
    assertions:
        alpheios.net
        The current user
    negations:
        User 1
        User 2
        User 3
    comments:
        Comment 1
        Comment 2
Let's say that the User 4 wants to confirm the assertion and create a comment about that. In that case two new objects need to be created: an assertion and a comment. Both of those new objects will be connected to the edge by the edge's ID, nothing else should be needed. The transaction to update the annotation database will send both objects along with the ID of the edge they're attached to. The resulting edge would look like below:
"D is a definition of a Lexeme L"
    assertions:
        alpheios.net
        The current user
        The user 4
    negations:
        User 1
        User 2
        User 3
    comments:
        Comment 1
        Comment 2
        Comment 3 from the user 4
Now let's assume that someone wants to create a new definition for the existing lexeme. In that case we'll need to:

Create a new definition object and assign a new unique ID to it.

Create a new edge: a connection stating that a new definition belong to the existing lexeme. It will hold two IDs of objects it connects: an ID of the existing lexeme and the ID of the newly created definition.

An assertion by the user who created the definition. It will state that the connection is truthful.

(Optional) a comment explaining why this definition should belong to the lexeme. All four of those objects will be sent to the annotation DB backend.

Would something like this work? if so, I will create GraphQL transactions around them.

I need to think about this a bit. Is the primary difference between this and the original model I proposed (and then revised to remove the assertions as nodes), is that rather than assertions/negations being nodes with a user at one end and an edge (treated as a node) at the other, they are properties of the edge itself?

kirlat commented 3 years ago

Is the primary difference between this and the original model I proposed (and then revised to remove the assertions as nodes), is that rather than assertions/negations being nodes with a user at one end and an edge (treated as a node) at the other, they are properties of the edge itself?

I was drawing a lot of diagrams on paper picturing possible ways to express lexical relationships and then was trying to match it to our existing data structures and to the possible GraphQL API. What I've described is the simplest way to achieve what we want that I've found.

There is an edge between lexicaI nodes and the user, but they are straightforward, has no metadata attached, is always one-to-one, would never be amended once created, so I decided to omit them and show users as properties. But technically it is an edge. I just did expose it as that for simplicity.

I think it's very similar to your approach:

I have removed the User from the graph model for now, opting instead to use a reference to the user in the createdBy property of both Node and Edge documents. I want to be able to associate a User with both an edge and a node within the same graph, and there isn't a graph-like way to do that that I can think of that doesn't break the graph. It might also make sense to some point to have completely separate User collections for data (similar to the language-specific collections of nodes). In LDP terms, the User is probably more a container than either a node or an edge.

I believe the "main" graph should portray relationships between lexical entities only. Users represent a different concept and I think they probably should not be on the graph. Having them as props that hold a reference to the user object in the user collection (as multiple objects could refer to the same user) should be sufficient for us, I think.

... we have the potential for many edges that essentially say the same thing, asserted by different authorities. I think ideally the query implementation would dedupe and aggregate them, but it will be up to the client to decide how to handle conflicting assertions.

I liked this approach and I've tried it first, but I think it's too complex and would create too many issues with representing it in both GraphQL and JS objects. So I've tried to replace it with a simpler one: one edge with many assertions attached. The sum of those assertions would decide how "strong", or "valid" the connection is.

So I think the major difference is that I suggest to replace multiple relationships representing an individual assertion or negation each with a single relationship having many assertions/negations attached to it (unless I'm missing any other important points). I think it would be way simpler to store it in DB this way and to present in GraphQL results.

We might also have one for the commentingOn relationship, one for the relationship between parts of alignment nodes, etc.

Similar to users, I think it's simpler not to create an edge, but just to attach a comment object to either node, edge, or another comment. First, comments are conceptually different from lexical entities; they probably belong to other, "non-lexical" dimension. Second, we have to be able to add comments to relationships (edges) but then we wouldn't be able to do so because we'd have to create an edge between an edge (a lexical relationship we want to comment on) and the comment itself (the node). Edges can connect nodes only. So we should not have an edge here, I think.

Here are my thoughts on this. What do you think? It's still fresh in my head and not fully formalized, but I think it's probably enough to represent some adjustments to a concept.

balmas commented 3 years ago

I think your approach about the assertions/negations is worth trying. It is probably easier to support than having edges as negations and I agree with the philosophical point that creating an edge to say a relationship doesn't exist is counterintuitive. However, we need to be able to have properties on the Assertions other than the user -- they also need, for example, level of confidence, and creation dates.

For comments, I'm a little less certain. I agree comments are probably in a separate dimension but we have to also consider the use case of comments on other comments. Maybe they need to go the other way --- i.e. Comments are Nodes and there can be a commentsOn relationship between two Comment Nodes, but a Comment on a LexicalRelationship references the LexicalRelationship edge it comments on as a property?

kirlat commented 3 years ago

However, we need to be able to have properties on the Assertions other than the user -- they also need, for example, level of confidence, and creation dates.

Should have no problems with it, I think.

For comments, I'm a little less certain. I agree comments are probably in a separate dimension but we have to also consider the use case of comments on other comments. Maybe they need to go the other way --- i.e. Comments are Nodes and there can be a commentsOn relationship between two Comment Nodes, but a Comment on a LexicalRelationship references the LexicalRelationship edge it comments on as a property?

What if in order to solve this conundrum we follow the FB approach and split comments on "comments" and "replies"? Comments could be attached to both lexical entities and lexical relationships. But if someone wants to add a comment on a comment, that would be a reply, and there will be an edge connecting the comment and the reply, or two replies, if it's a threaded discussion. It also would be in-line with original meanings of the terms: https://ux.stackexchange.com/questions/118624/comment-vs-reply-in-a-social-feed.

Here is a piece of documentation confirming that FB treats comments and replies differently. Not sure what's the reason, but maybe they faced issues similar to what we're trying to solve.

We could probably think of it as of different transparent planes stacked on top of each other. The base plane is the one with the lexical relationships graph. The one on top is comments/replies. The comment prop on the lexical graph plane may become a node on the comments plane to which replies can be attached.

So the comments/replies graph would exist only if there are replies to a comment. There would be multiple reply graphs each having a comment as a root node.

balmas commented 3 years ago

Hmm. It's not clear to me from that FB link that Facebook really treats comments and replies separately -- to get the comments on a comment you access the /comments edge and it says that a comment may be a reply.

I think we have the following use cases for comments

(1) a comment on a lexical entity node a comment on a lexical entity relationship (3) a comment on (or reply to) a comment

for (1) and (3) it seems pretty clear to me that the comment should be a separate node, and the relationship between the comment and the thing it comments on is an edge.

for (2) it's murkier, but it seems like perhaps then we still create the comment as a node, but here it is referenced as a property of the lexical entity relationship, and then comments/replies to it are in the comments/replies graph.

I think this is essentially what you were suggesting, except I think the comment should always be a node regardless of whether it has any comments/replies to it.

kirlat commented 3 years ago

I'm not fully familiar with the FB approach but that phrase from documentation

Replies - If this is a comment that is a reply to another comment, the permissions required apply to the object that the parent comment was added to.

and the way they use /comments edge hints to they way that they have replies as a special type of comments, yet within the comments group.

See also the user comment in the other link stating that

Facebook also allows "Replies" to "Comments"

But those probably are just terms used so that the model made more sense.

I guess generally we have a plan and this is just one of the minor points. Technically, since edge would have an ID, we can attach anything to it (a node). And if we consider comments to be on a different plane this would not break the model of the lexical entity relationships graph. But I'm not sure if it's the best way to implement it. Will think more about it.

kirlat commented 3 years ago

I was thinking about it, and reading about different implementations, and I think we might solve our issues if we introduce a node in the middle (NiM) between a lexeme and a definition. It is something similar conceptually to the "connection" concept that many GraphQL tools have (but probably it's not exactly the same). A connection point would provide a place where comments, along with other data, can be attached.

This is how it might look on a diagram:

So the connection point reflects the fact that there is a definition attached, but it that does not describe what this definition exactly is. I think those two concepts are related, but yet separate.

I think it might be very beneficial not only as a point to attach comments, but as a way to track changes. There could be two types of edits: one to replace a definition text with the other one, and the other to add a new definition, or remove definition altogether. The diagram with connection points allows to track both those cases.

Let's check the diagram with direct connections between a lexeme and definitions:

Let's say someone decided that the first definition does not make sense for the lexeme and has to be the deleted and the text of the second definition should be edited. The resulting diagram would be: From the resulting state, however, is not clear whether it is a result of editing the first definition and removal of the second or the opposite. The other problem is: if history and comments are attached to the edge, how would it be kept if a definition is removed? We cannot have an edge that points to nothing. It has to be deleted. As a result, the related history will go away with it, but I think we would like to know why, when, and why removed the definition.

If we'll use connection points, the state after the edits will look like below:
It is clear from the diagram now that it's the first definition being deleted and the text of the second definition being replaced. We will also retian an empty connection point that would keep the data about the deletion.

Let's complicate the situation and say that an addition to the edits described above (first definition removed, second one edited) there is also the third definition added. With direct connection, the resulting diagrams would be: I think it does not convey what exactly happened here.

With connection points, a diagram would be much more informative, on my opinion:

The connection point might seem to be an unnecessary complication, but, at the same time, it helps (I think) to solve our problems and would ultimately serve to our convenience. I think we can think of connection points as of the number of definitions attached to the lexeme (i.e. if we can say "this lexeme has three definitions" this will mean three connection points). The definitions itself, on the other hand, could be think of as just a representation of a definition text. As with value objects, if we want to edit a definition text, we create a new Definition object with the text edited and reattach it to the same connection point, replacing the old definition. The old definition object could be kept in database, and being referenced from the history of definition edits. If someone would decide that the old definition was better, it can be reattached to the same connection point.

What do you think about this approach as a whole?

balmas commented 3 years ago

I think that could work. Would there also be a “replaces” edge between definition c and b ? (Eg definitionC replaces definitionB)?

kirlat commented 3 years ago

I think that could work. Would there also be a “replaces” edge between definition c and b ? (Eg definitionC replaces definitionB)?

I'm not sure that we need it. I thought we might keep all edit-related info at the connection point. If I understand use cases correctly (please let me know if not), we would never ever replace Def.C with Def.B once and for all. We probably have to keep both indefinitely.

So let's say we retrieved lexical data and are building a lexeme. According to our algorithms we think that is should have a Def.B attached to this lexeme. We create an assertion stating that the Def.B is attached to this connection point, authored as "alpheios.net". Some time later someone thinks this is wrong. He/she creates a negation stating that the Def.B does not belong to this connection point and an assertion that the Def.C is the right one. So the Def.C (it has one assertion, assertion score is 1) should normally be attached to the lexeme. That beats the Def.B which has an assertions score of 0 (one assertion vs. one negation; I'm ignoring confidence, weight, and other factors here).

But if an option be set to ignore user's annotation, then the same lexeme would be rendered with the Def.B.

So it seems both definitions need to be connected at the same time, and the definitions are never really removed. The diagram would look more like below:

It also means that assertions, negations, and comments should contain the ID of the definition to which they are attached. We then would check all connections existing from this connection point (to Def.B and Def.C), gather assertions/negations related to each, and, based on that, decide what connection should prevail.

Is that so? Do I understand things correctly?

balmas commented 3 years ago

Ok. I guess I was thinking of the versioning scenario, where definition c was a correction of the text of definition b . But I think we should not get too bogged down with all of the possible variations right now. I think the structure you have proposed, with the node-in-the-middle addresses one of the key things that was still troubling me about the data model design and is a reasonable jumping off point.

I will work on introducing that into the prototype ArangoDB model.

kirlat commented 3 years ago

As we've discussed previously, it would be not a good idea to change the existing Data Model objects in order for them to support annotations. That's because other apps are using them that do not embrace the annotations concept.

How about the components library, and, especially, the UI components themselves? In order to support annotations and annotation options they should be aware at leas of the existence of annotations: there would be annotation options to display, and many annotation-related data objects, such as comments, assertions, and negations that need to be displayed by the UI. The lexical query should also be aware of the annotation-related data in order to factor it into the resulting Word object. The best we could do is probably to keep some UI and other components annotation-agnostic, but we can't do it for all of them. The annotation knowledge has to trickle in somewhere inside the components (but we can try to keep this tightly controlled).

So my question is that: would we ever need an assembly of components (i.e. the build of the components library) that is annotation-unaware? If so, we can handle it, probably, this way:

Make annotations be a package on top of components, providing some alternative component and method implementations that will include annotations data.
If we want an "annotation-free" build of components, we'll use the components library.
If we want annotations to be included, then the build from the annotations package should produce the "annotatable" version of components.

If that is not needed, we can simply let the (hopefully) limited amount of annotation knowledge to trickle into the components, maybe in a form of plug-ins and/or modules.

What would be the best approach to handle that?

balmas commented 3 years ago

As we've discussed previously, it would be not a good idea to change the existing Data Model objects in order for them to support annotations. That's because other apps are using them that do not embrace the annotations concept.

I'm not sure we have concluded that. See my comments at https://github.com/alpheios-project/documentation/issues/40#issuecomment-755336114

balmas commented 3 years ago

I would rather not think of this as annotations-included or annotation-free. But instead, recognize that the data sources that contribute to produce the final data that the user sees are fluid and both the user and the application may influence not only which data sources are included but also how they are combined.

kirlat commented 3 years ago

Here is the first take at GraphQL type definitions with annotation support: https://gist.github.com/kirlat/5c36baaf26e3ea399bfe36d0a354c7b1. Only some objects are annotatable (Lexeme, Definition, and the connection between them); I think we can add that to other objects later.

What do you think? Am I missing anything there?

balmas commented 3 years ago

Thanks! I added some comments directly in the Gist.

kirlat commented 3 years ago

Please check an updated version with the suggested changes implemented and some mutations added: https://gist.github.com/kirlat/5c36baaf26e3ea399bfe36d0a354c7b1

I've also made types more specific by introducing the Assertion and the Negation types. I was following an advice from this article.

balmas commented 3 years ago

Comments added to the gist.

kirlat commented 3 years ago

Per discussion on Slack with @balmas: we need a way to integrate annotation data into our existing data model without significant changes to the data model itself (for several reasons).

The current situation: Lexical query produces a Homonym or HomonymGroup object which contains all necessary information encapsulated within the objects it is comprised of.

How could this be changed to accommodate for the annotations data? Two approaches comes to mind.

One is the centralized annotation data storage: Data Model Object stores the results of the word lexical query. It has a method that returns the word object. That might be used instead or along with both the Homonym and the HomonymGroup (there could be methods to return either Word, Homonym, and HomonymGroup). The lexemes and other objects down the hierarchy are exactly the same as they are now. They do not contain any annotation data.

An annotation data could be retrieved, updated, added, or removed via specialized methods of the Data Model Object class. Lexical elements to which annotation data is connected are referred by their IDs.

In this model any piece of code has to have a reference to the Data Model Object and can use its methods to retrieve/alter the annotation data.

The other approach is when annotation data is spread across all lexical data objects within the hierarchy. We could keep the structure of the lexical objects (Lexeme, DefinitionSet, Definition) the same as it is now, but add an AnnotationData object to the prop dynamically. This way changes to the object would be minimal. It would also be backward compatible: the parties that are not aware of the annotation data would simply ignore it and work with lexical objects as before: Whoever need annotation data would use methods of the AnnotationData objects to obtain and edit it.

Another option, a combination of the two approaches described above, is to integrate not the annotation data, but the annotation API to the lexical data items. These API methods would retrieve and change the data that is located within the Data Model Object: I think of it as of a convenience version of the first approach. Annotation methods are grouped into objects whose annotation data will be accessed. As a result, instead of a "big ball of methods" we have methods grouped in a nicer way. Also, since methods are aware about what data object they're attached to, their signatures can become simpler. For example, to add an annotation to the lexeme, instead of dataModelObject.addLexemeAssertion(lexemeID, assertion) we can use lexeme.annotations.addAssertion(assertion).

We might use the same "distributed API approach" in other cases. For example, if the lexeme has changed and we want to pull the updated data, we can use something like lexeme.update() instead of dataModelObject.updateLexeme(lexemeID). The method will request an updated data from the Data Model Object and will update the lexeme data fields with it.

Are there any other approaches possible?

What do you think would be the best way to go for us?

P.S. After reading this interesting Stack Oveflow question I've started to think that another beneficial approach might be for the Data Model Object to return lexical objects without the annotation data attached but then the annotation-aware code to use the annotation API of the Data Model Object (or something else) to pull the corresponding annotation data and, possibly, attach it to the lexical data objects (or not to attach it and use as separated objects). That would provide the best isolation between the lexical data and the annotations as only the annotation-aware code would pull the annotation data into the application context.

balmas commented 3 years ago

I think the 3rd approach is the closest to what we need to do, not only for annotations but also for the lexical data itself.

A big problem for the 2nd approach (annotation data kept in the data models) is that it doesn't account for the inter-dependencies between the different parts of the lexical data objects.

The work on the treebank disambiguation has made this a little clearer to me and I think it applies to both the aggregation/disamibugation workflow and the annotation workflow.

Looking at that a little more closely, we currently have something like this:

Lexical module queries the individual data sources via the client adapters
The client adapters apply domain business logic to create alpheios data objects out of the raw data, applying both language-specific and resource-specific rules.
The lexical query aggregates the results
The lexical query tells the Homonym object to disambiguate all the possible Lexemes
the Homonym object tells each Lexeme to compare itself to another Lexeme
The Lexeme object tells its Lemma to compare itself to the other Lexeme's Lemma
The Lexeme object tells each Inflection to compare itself the other Lexeme's inflections
Only when we have all of the comparison results, can we make a decision about how to combine those results to create the Homonym object we show the user.
Lexical query then passes the resulting aggregated/disambiguated data model objects back to the application, discarding the original query results

There are a number of problems with this including:

1) the code cannot yet handle comparing more than 2 possible sources of data, and the rules about how they are combined are essentially hard-coded.

2) By the time we show the user the results, the details of how we made our decision of what to show them are no longer present in the object, so we can only show the results and now how we came to them. This will be a problem for annotation because we need to be able to be explicit about each piece of data we are annotating and tie it back to its original source.

3) The query results are transitory. The annotation/disambiguation process changes the results and there is no way to go back in and recombine them in a different way using different rules.