Lexical Query Refactoring

balmas commented 4 years ago

This issue is to discuss requirements and design decisions for refactoring the lexical query

Some desired features driving the need for refactoring:

User annotation (corrections, additions, comments) of morphological data
Lookup history
Multiple popups
Multiple layered parsers per language (and per user preference)

balmas commented 4 years ago

Some problems with the current code:

The multiple asynchronous, interdependent remote and local service calls with differing logic per language and per page (e.g if treebank is available or not) make the code very complex and increasingly difficult to add new sources of data .
many of the remote data sources are fairly static and caching of combined calls could improve performance
current code does not support more than one query result at a time and all calls need to be reexecuted to reissue a prior query
- this is less than optimal for the wordlist downloads, for which only short definitions are needed
cancelling a request mid-stream may not work optimally

irina060981 commented 4 years ago

I would point to more issues discussed on the meeting (as I remembered :) ):

we have a feature - download wordlist with short definitions, and sometimes we need to re-request it from the remote source, with the current implementation we
- should first request morphology data and only then request for definitions data
- we need to have a clear point of finishing definitions requests
we need to have an ability to execute lexical query without direct defining dictionaries for definitions (have some built-in default definitions)

balmas commented 4 years ago

Our lexical query sequence is something like the following (although not all sources are represented here):

alpheiosflow

The end result from the internal data model perspective is something like this:

homonym_1 (1)

Each word lookup results in a single Homonym object whose parts are populated by a number of different sources, with business logic applied to the ways the different sources of data are combined and/or dependent upon each other.

The calls to the data services are made asynchronously so that we can respond quickly with some data while the rest is being retrieved but there are also dependencies between the responses.

As we increase the number of back-end services, we need a better way to manage the business logic dealing with the dependencies between the results. We also need to support different dependency chains and workflows depending upon

user preferences
language characteristics
available resources.
source/purpose of the request

There are a number of hacks in the current code that were made to force resources to fit into a specific workflow of:

Single Word Selection -> Single Homonym Object -> Lemma Identification -> Definition Lookup

For Chinese, we often get multiple Homonyms from a single word but our data model doesn't support that so we currently treat them all as lexemes of a single Homonym. (see alpheios-project/alpheios-core#135)

For Persian, there is not a clear way to identify a lemma of a word and a tradition of lexical resources being tied to well accepted lemmas, so we have implemented a custom workflow which skips the morphological parse and goes straight to the short definition lookup on the full form.

These hacks should be removed with the refactoring so and the workflow requirements taken into account as first-class requirements.

The following diagram shows the interdependencies between the current lexical data sources:

lexical_data_dependencies

We will be adding new datasources per the list at: https://github.com/alpheios-project/documentation/blob/master/development/data-services/datasources.csv

And we want to move towards workflows that look like the following:

lexicalworkflows

kirlat commented 4 years ago

Organizationally, we can probably separate LexicalQuery refactoring requirements into three groups that are almost independent on each other:

Lookup history and multiple popups.
Multiple layered parsers per language (and per user preference).
User annotation (corrections, additions, comments) of morphological data.

On my subjective opinion, those task are listed in order of an increasing complexity. Good thing is that we can tackle those tasks independently.

The problem with handling multiple homonyms can probably solved by the wider use of the HomonymGroup object.

Lookup history and multiple popups: This probably can be solved by storing results of each lexical query within an independent HomonymGroup object. Each HomonymGroup would thus represent results of the individual lexical query, Storing a series of HomonymGroup group objects would represent a history of lexical request. It would allow to cache them easily. Updating a popup Vue component to allow multiple ones to be used simultaneously and assigning an individual HomonymGroup to a specific popup instance would work toward implementing the other part of task requirements.

Multiple layered parsers per language (and per user preference): the diagrams provided above are extremely useful in understanding the specifics of this task. The very important question is how to construct a data retrieval business logic so that it will be simple enough while accommodating specifics of each language/word/context and yet flexible enough to allow to add other data sources or even change the data retrieval logic without affecting the outside parts of the application.

GraphQL seems to be a promising answer but it has to prove that it can fit into our current architecture smoothly. To make data retrieval more efficient by using caching and reducing the number of roundtrips it might be beneficial to move that logic to the server. A client will issue one request to the server to retrieve all morphological data and the server, on the client's behalf, will go to other servers, retrieve the data needed, and return it to the client. We cannot, however, go full-in to the server-side solution because some sources such as treebank can be client-side only. Luckily, GraphQL solutions such as Apollo may allow us to combine both server-side and client-side sources within the same GraphQL query.

The other problem we face here is that, in order to make UX smooth, we cannot wait to show query results until all the data is retrieved. We need to show data pieces as it arrives, even if not all data is available yet. I'm not sure yet how to solve this with GraphQL. The ideal solution would be a @defer directive of the Apollo, but it is, unfortunately, is not implemented yet. We might split the GraphQL queries but that, I'm afraid, will kill advantages of GraphQL altogether. But I'm sure we're not the only ones facing this problem so, hopefully, there is something somewhere already.

User annotation (corrections, additions, comments) of morphological data: for solving this, technologies related to Linked Data (Solid, JSON-LD, RDF, Turtle, and Dokieli) seem to be like a perfect match. With this, we will be able to link any user annotations located on our or other servers with morphological data and texts located somewhere else. So probably we should move in that direction with this.

These are my general thought on the implementation. Do they make sense? What do you think about it?

balmas commented 4 years ago

Yes I think that we are generally in agreement regarding implementation and approach, and I mostly agree with your division of the refactoring.

There will be some overlap between the 2nd and 3rd groups however (multiple layered parsers per language and user annotation). I also think it's probably clearer to refer to the 2nd group as something like "multiple layered sources per language" as parsers are only one possible source of data. In the datasources list https://github.com/alpheios-project/documentation/blob/master/development/data-services/datasources.csv I've tried to show all the different possible sources, current, planned and future, and what they contribute. So for example, user annotations are one source of morphological data that need to feed into disambiguation of results.

I will work on the user stories for the Lemma/Morph Workflows and Resource Lookup Workflows to hopefully make this clearer.

I really am hoping that use of GraphQL will help make the implementation cleaner. But agree that it remains to be seen, particularly with regard to the performance.

kirlat commented 4 years ago

If we show several popups, what morphological info would be shown in the panel? Now we have a single popup, and morphological data in the panel matches the word that is shown in the popup. How would it behave if we'll show several popups at the same time? Would the content of the panel be changing depending on what popup is active? Would we have a dropdown in a panel that will allow to chose between words in the popups that are current open? Would we change the UI in some other way to support multiple popups?

balmas commented 4 years ago

If we show several popups, what morphological info would be shown in the panel? Now we have a single popup, and morphological data in the panel matches the word that is shown in the popup. How would it behave if we'll show several popups at the same time? Would the content of the panel be changing depending on what popup is active? Would we have a dropdown in a panel that will allow to chose between words in the popups that are current open? Would we change the UI in some other way to support multiple popups?

This is a good question. What do you think is better for the user? @irina060981 @monzug @abrasax what are your thoughts?

kirlat commented 4 years ago

I think it would make sense, if we to show multiple popups with different words, to group all related morphological data together. Otherwise it might be confusing to understand what is related to what. I'm not sure what is the best way to do that.

One solution might be to have a popup with tabs that will provide access to full definitions, inflection tables, etc. The popup will be too small for full definitions or inflection tables so maybe we should add an ability to expand it to the full height of the screen temporarily (and maybe even to the full width too).

But I do not think this is an ideal solutions. Maybe we can do something else? I have a feeling that showing multiple words at the same time would force us into some significant changes of the current UI. Luckily, it should not be too hard to change that because with our current modular architecture we can place Vue components that display morphological data almost anywhere with very little effort.

irina060981 commented 4 years ago

About Lookups and multiple popus: I think that the main reason, we want to have multiple popups, is to have an ability to compare morphological results side by side. And if it is so tabs won't be the most useful way to reach it.

May be it would comfortable to go this way:

there would be two modes of the application - single/multiple, defined in settings
single - as it works now
multiple:
- there would several popups (it could be limitted inside settings too)
- there would be one panel divided to two parts (up and down), in each part could be selected a targetWord (if there are only two words selected, may be we don't need such option)
- each popup would have its own pin button - if a popup is pinned then it would be unchanged on any next lookup, otherwise it could be reloaded with a new lookup (if defined amount of available lookups at once is reached)

what do you think about going this way?

irina060981 commented 4 years ago

About Multiple layered parsers per language (and per user preference) I think that we could use here one of the following approaches:

we define a unified workflow with different conditions as it is done now
we could use some unified "language" for defining unified API like GraphQL (but it is not ready for all our need now)
and we could go another way: we could create a bunch of classes/subclasses for different types of requests (similiar as we have Views inside inflections table), for example SimpleLatinLookup, ShortDefinitionsLatinLookup and so on. We could create them data-independent (only static methods) and even move to a separate module (and it could be used inside other applications). From my point of view it is close to GraphQL idea but has less limitations.

what do you think abot this variant?

kirlat commented 4 years ago

I think you provided a good classification of approaches. I only do not see GraphQL as an approach but rather as a way to define an interface. On my opinion, GraphQL request does not specify how data should be retrieved and processed. Sometimes GraphQL requests could be built the way that specifies it but I think it should not be used this way: GraphQL should specify only what data needs to be retrieved, not how it should be gathered. It is the responsibility of the request processor to decide the optimal way to deliver the data requested.

With GraphQL, underlying processing functions are tied to individual fields or group of fields. The the question is how to construct those functions to retrieve data in the way that is most efficient. This is the problem we're facing now. So we can probably put GraphQL out of the equation for the discussion of how it's best to retrieve data.

The approach (1) that we're using at the moment has obvious drawbacks that we're trying to solve: when the logic becomes more and more complex, keeping track of various conditions becomes an enormous task. It leads to a "spaghetti code". It's very hard to maintain for anything little harder than obvious, on my opinion.

I like the approach (3) much more than (1). With it, each request is isolated from each other and we're free to change it without fear of breaking some other requests as is often the case with (1). However, (3) is not without its drawbacks: once the number of different types of requests increase, we would have more and more specialized classes for each request type. It will be progressively harder and harder to maintain. Also, with approach (3) we would almost inevitably have duplicated code across several request classes. This is an OK price to pay for isolation but I'm wondering if we can do better.

Speaking in a plain language, we have a bunch of operations that retrieve data. Some operations are dependent on each other (we cannot retrieve full definitions until morphological data is obtained). Some operations are independent (retrieving data from the Tufts analyzer and the Treebank) and can be executed in parallel.

This sounds like a problem many software developers are facing. To the rescue, there is a concept of a directed acyclic graph (DAG). And I think (I'm in the process of researching, but I've seen some solutions already) there are some libraries that can build a graph like that. Once the graph is built, our task simple: navigate the graph and fulfill operations according to their execution order.

So, on my opinion, we can do the following:

Create a set of atomic operations of data retrieval. It's already pretty much done with the client adapters; we can add more operations as needed.
List dependencies between operations (we haven't done this in a formalized way now).
Use a library to make the topological sorting of the abovementioned operations.
Navigate the graph and execute operations in their specified order.

With this, adding new operations (i.e. a new lexical data retrieval) is extremely simple and transparent: define an operation, list its dependencies, and redo the topological sorting. That' it!

So what do you think about an approach like that? Would it work for us? Are there any aspects of the requirements that it will not address? Are there any drawbacks with an approach like this?

irina060981 commented 4 years ago

From my point of view the conceptual difference between graph concept and classes arrangement is not very big. Classes are always hierarchical - one vertical direction relationship Graph are plain - multiple horizontal relationships

@kirlat , why do you think that the second variant would be easier to support? I am not familiar with this technology but it seems to me, that finaly we would have somewhere a bunch graphs descriptions for each request and there amount would be the same as classes in the first approach. am I right?

kirlat commented 4 years ago

From my point of view the conceptual difference between graph concept and classes arrangement is not very big.

That could be true. Could you please describe how do you envision the implementation of classes like SomeLatinLookup? Would there be function inside a class that will specify the sequence of calls: Tufts and the Treebank first, then disambiguate, then lemma translations and word usage examples, then short and full definitions (this is just an arbitrary example of data retrieval sequence, not an actual workflow)? Just want to be sure that I understand your idea correctly.

What I liked about that DAG approach is that it allows to abstract things away a little build and take care of building dependency graphs automatically. This is how I was thinking we can use it.

Let's define some atomic operations (please consider this as just a rough example, its purpose is not to go too deep into details at this stage): Operation	Purpose	Dependencies
Tufts	Retrieval of data from Tufts	none
Treebank	Retrieval of data from Treebank	none
DisambiguatedHomonym	Merge of data from Tufts and Treebank	Tufts, Treebank
LemmaTranslations	Retrieve translations of the lemma	DisambiguatedHomonym
UsageExamples	Retrieve usage examples	DisambiguatedHomonym
ShortDefs	Get short definitions	DisambiguatedHomonym
FullDefs	Get full definitions	DisambiguatedHomonym

For the items listed in the table the execution order will be:

Tufts, Treebank in parallel.
DisambiguatedHomonym.
LemmaTranslations, UsageExamples, ShortDefs, FullDefs in parallel.

So for a simpler workflow, we might claim that we need DisambiguatedHomonym and ShortDefs (a homonym with short definitions) and the system will build a graph from there automatically. For more complex cases, we may claim that we need DisambiguatedHomonym, LemmaTranslations, UsageExamples, ShortDefs, and FullDefs (i.e. almost all data that can be available). It will result in a different graph.

If the treebank data does not exist or does not make sense, we can adjust a dependency of DisambiguatedHomonym dynamically by removing a Treebank dependency from there. Or we could still have a Treebank dependency in the chain, but make it return an empty result immediately if there is no Treebank data, if that'll be simpler (just another way to bypass the Treebank dependency).

So for each request type we just need to keep a list of things that we need to get, and that's all. Dead simple. If we introduce some new request type, we just list new things we need to that request and care about nothing else. It may play well with GraphQL where we specify what items we need to retrieve in the request itself (i.e. it may contain fields like homonym, shortDefs, fullDefs and so on).

That is my idea in general. What do you think?

balmas commented 4 years ago

About Lookups and multiple popus: I think that the main reason, we want to have multiple popups, is to have an ability to compare morphological results side by side. And if it is so tabs won't be the most useful way to reach it.

May be it would comfortable to go this way:

there would be two modes of the application - single/multiple, defined in settings

single - as it works now

multiple:

there would several popups (it could be limitted inside settings too)

there would be one panel divided to two parts (up and down), in each part could be selected a targetWord (if there are only two words selected, may be we don't need such option)

each popup would have its own pin button - if a popup is pinned then it would be unchanged on any next lookup, otherwise it could be reloaded with a new lookup (if defined amount of available lookups at once is reached)

what do you think about going this way?

I agree we need to be able to "pin" a popup. Still thinking about the panel.

balmas commented 4 years ago

The DAG provides a very helpful way to think about and describe the dependencies between the different data sources at least. Are there some pre-existing libraries that we would use for implementing this?

kirlat commented 4 years ago

Are there some pre-existing libraries that we would use for implementing this?

I came across toposort and batching-toposort. The latter can produce batches of tasks that can run in parallel.

Those libraries are very simple, but I don't think we need anything more sophisticated than that for what we need to do. Being simpler is better, and reduces the overall code size.

I'm sure there are much more libraries out there. I will continue checking them.

balmas commented 4 years ago

lemma_morph_workflows (1)

I think these are the main use cases for the Lemma/Morph Workflows.

Additional requirements to note:

They should be able to be combined in different ways (e.g. a combination of Use Case 6 with other scenarios)
It must be possible for user preferences to control some scenarios (e.g. if multiple parsers are available as in Use Case 2, which to use, which takes precedence)
It must be possible for site preferences (e.g. embed-lib config) to control some scenarios (e.g. if page data is supplied, it may be necessary for the page to say if user data can be used to disambiguate)

N.B. that these are just the Lemma/Morph workflows. They may feed into the Resource Lookup Workflows but don't include them.

irina060981 commented 4 years ago

That could be true. Could you please describe how do you envision the implementation of classes like SomeLatinLookup? Would there be function inside a class that will specify the sequence of calls: Tufts and the Treebank first, then disambiguate, then lemma translations and word usage examples, then short and full definitions (this is just an arbitrary example of data retrieval sequence, not an actual workflow)? Just want to be sure that I understand your idea correctly.

I think that it could be arranged the following way. We could define atomic operations with properties, for example: tufts, shortDefs sync/async and so on (similiar graph model) Each atomic operation has all logic about it for all input conditions (and could be expanded later)

These operations - are one type of classes

And the second type would be requests, thye have inside a bunch of such operations. So each request class has specific checks (it could be conditions check or may be access checks for user data) and execute method with full description of operations workflow.

If we want to expand requests - it could be easily done by adding operations or request. Each request is defined in its class - so it could be easily supported.

That was my idea - I think that it is similiar to graph approach, but all requests are staticly built without automatically graph building.

kirlat commented 4 years ago

I think that it could be arranged the following way. We could define atomic operations with properties, for example: tufts, shortDefs sync/async and so on (similiar graph model) Each atomic operation has all logic about it for all input conditions (and could be expanded later)

These operations - are one type of classes

And the second type would be requests, thye have inside a bunch of such operations. So each request class has specific checks (it could be conditions check or may be access checks for user data) and execute method with full description of operations workflow.

If we want to expand requests - it could be easily done by adding operations or request. Each request is defined in its class - so it could be easily supported.

That was my idea - I think that it is similiar to graph approach, but all requests are staticly built without automatically graph building.

I think we're pretty much on the same page here. As you're correctly saying, the only difference between the "static class" approach and DAG is whether the execution path is built statically or dynamically. The dynamic approach, on my opinion, has advantages when execution path needs to be adjusted in real time or if they are too complex, so it's better the library to take a burden of building the path for us. But if the execution paths are relatively simple, static paths will do. So maybe we can start with static paths and then move to the dynamic ones if the static approach won't be enough? It would be pretty much the same except that for the dynamic paths we'll need to kick an a topology sorting library to build an execution sequence out of tasks when with the static approach we'll build the same sequence manually. @balmas, @irina060981, what do you think?

irina060981 commented 4 years ago

I think that starting from static approach will allow to see the scope and estimate complexity. So I vote for this :)

kirlat commented 4 years ago

Use cases are extremely helpful in understanding what needs to be done! We can map the diagrams to the tasks and the execution sequences directly. Do I understand correctly that short and full definitions, lemma translations, usage examples retrieval steps are not shown there? Are diagrams showing all the steps up to the lexeme construction?

balmas commented 4 years ago

So maybe we can start with static paths and then move to the dynamic ones if the static approach won't be enough? It would be pretty much the same except that for the dynamic paths we'll need to kick an a topology sorting library to build an execution sequence out of tasks when with the static approach we'll build the same sequence manually. @balmas, @irina060981, what do you think?

This is okay with me.

balmas commented 4 years ago

Use cases are extremely helpful in understanding what needs to be done! We can map the diagrams to the tasks and the execution sequences directly. Do I understand correctly that short and full definitions, lemma translations, usage examples retrieval steps are not shown there? Are diagrams showing all the steps up to the lexeme construction?

Correct, I haven't gotten to the resource lookup use cases yet. Those are next...

balmas commented 4 years ago

I think we might want to reconsider whether Full Definitions are part of our Lexeme Data Model object. We have done so up until now because we want to prefetch them so that we know whether we can offer the Define button on the popup, but as I look at the new use cases it seems to me that these are really an instance of a more general use case in which we prefetch a related or linked resource for display once the user chooses to go further.

So, for that reason, I have omitted full defs from the use cases to get from word selection to a state of "lexeme ready" in the following diagrams:

lexemesready (1)

(I've updated the diagram above to remove the Refine Search step -- I believe that's a separate use case/workflow)

balmas commented 4 years ago

here's the diagram for the Retrieve and Disambiguate short defs use case

shortdefsretrieveanddisambiguate

balmas commented 4 years ago

And the search for word in dictionary use cases -- from these the output is a homonym group, because the lookup of the word will return zero or more possible words, each of which might be a homonym with multiple lexemes.

In the remote dictionary use cases, there is an input "search type" which allows for variations on how the search is done. E.g. options could include: exact match, beginning, end, containing, etc

searchforwordindictionaryresource

balmas commented 4 years ago

(N.B. I'm moving the discussion of multiple popups and session history to a new issue #37)

alpheios-project / documentation

Lexical Query Refactoring #33