WordList refactoring and approaches to GraphQL

alpheios-project / documentation

Alpheios Developer Documentation

0 stars 0 forks source link

WordList refactoring and approaches to GraphQL #38

Open kirlat opened 3 years ago

kirlat commented 3 years ago

After making some successful tests with the Appollo GraphQL and realizing it's strong points (an ease of combining local and remote queries, full-fledged in-memory cache, among other things) and limitations (there is no way to make trully asynchronous requests for local data, see notes about the read method that is used for that) I would like to offer for discussion a concept of the possible WordList refactoring.

One of the important facts about GraphQL queries (that is applicable to other queries as well) is that they return "lean" JSON-like objects (i.e. data only with no methods). We, on the other hand, use "rich" JS objects (full-fledged objects with powerful methods and many auxiliary data items) almost everywhere. Once we receive a lean data object from, we, following our current practices, would convert it to the JS object. If we would like to update a WordItem in the GraphQL storage, we would have to convert the JS object back to the JSON object.

I'm wondering what if we would be lazy with such conversions? Having a JSON word item object, we could always convert it to WordItem JS object whenever necessary. Having a plain JSON object has several advantages, on my opinion:

JSON objects takes less memory. It's often not that much less, but having many objects in memory may make a difference
JSON objects can be serialized and de-serialized easily. They can be used for GraphQL mutations without a need for extra conversion. That might be important if we're planing to rely on GraphQL as our main data API.
It's much easier to attach reactivity to JSON objects than to more complex ones.
It's easier to merge two simple JSON objects together. We can use third-party libraries for this (such as Automerge) So what do you think if we start using more of JSON objects as our main source of truth?

Apolllo has a powerful caching. It is behind almost every GraphQL request that goes through Apollo. If we would to accept GraphQL as our API for data retrieval, it would, on my opinion, make sense to get full use of the Apollo caching instead of our own solutions (and we have to use caching within many Apollo use cases anyway). Would that be acceptable for us?

If GraphQL will prove itself, it might make sens to use GraphQL for many things as a standardized data retrieval and/or data update interface. We could use it to store options, for example, within the options refactoring work. A universal API to store different types of data may simplify things a lot.

The question with GraphQL is where to put the business logic related to data management (i.e. data merges and transformations). It can be that a GraphQL data provider would return a "raw" data and then the requesting object would be responsible to transforming it into the form that is required. Or it could be the GraphQL data provider that would allow to retrieve data in many different forms, and the requester would specify in what form the data should be obtained via a GraphQL query. The GraphQL data provider would do necessary transformations behind its facade and will return data formatted according to the needs of the client. I think the latter approach would be more in the spirit of GraphQL and we probably should use it whenever possible. What do you think?

For the WordListController architectural change, I think that it would make sense to create a GraphQL-enabled object that would sit between the WordListController and the UserDataManager. We can call it as WordListDataManager or some other way. Instead of keeping all WordItems in the WordListController, they can be stored in the cache of the WordListDataManager instead. The WordListController would then issue GraphQL requests to retrieve or update individual word items from the WordListDataManager. WordListController would receive word item(s) as JSON object(s) and then convert them to a WordItem JS object(s) as necessary (hydrate the WordItem object with the word item data). Would an approach like that make sense?

@balmas, @irina060981, please let me know what do you think about all this. Thanks!

balmas commented 3 years ago

I'm wondering what if we would be lazy with such conversions? Having a JSON word item object, we could always convert it to WordItem JS object whenever necessary. Having a plain JSON object has several advantages, on my opinion:

I think this makes a lot of sense for the data model objects

The question with GraphQL is where to put the business logic related to data management (i.e. data merges and transformations). It can be that a GraphQL data provider would return a "raw" data and then the requesting object would be responsible to transforming it into the form that is required. Or it could be the GraphQL data provider that would allow to retrieve data in many different forms, and the requester would specify in what form the data should be obtained via a GraphQL query. The GraphQL data provider would do necessary transformations behind its facade and will return data formatted according to the needs of the client. I think the latter approach would be more in the spirit of GraphQL and we probably should use it whenever possible. What do you think?

Yes, I agree and it's one of the reasons motivating the refactoring, so that the client code can easily specify data sources and priorities and let the GraphQL api manage the details. However, we have to think carefully about the API -- i.e. it will need to be flexible enough that user preferences around which data sources to use and which take precedence in a merge/disambiguation scenario can be specified as inputs to the GraphQL query.

For the WordListController architectural change....

Let's talk through this at our check-in on Monday. I would like to understand how the session history fits in here.

irina060981 commented 3 years ago

I'm wondering what if we would be lazy with such conversions? Having a JSON word item object, we could always convert it to WordItem JS object whenever necessary. Having a plain JSON object has several advantages, on my opinion:

In our current Wordlist actions we use both lean and rich data. We store data inside local and remote storage in Json-like way, and convert it to an object with methods while working with it. The reason we use object-oriented model for data - to reduce duplicated code for objects actions. If we don't have enough amount of duplicated code we could divide the code - use plain objects for data and static methods for classes.

The main problem with lazy conversions is to convert it both ways each time when object is updated. Now (for worditems) we convert it each time we save to local/remote.

From my point of view object-oriented model with inheritance and methods allows to have more abstract code. But it needs to make two-ways conversions. JSON objects would lead to the need to store a reference to a class name and conversion methods.

So in both ways conversions would be used often. From my point of view the following advantages depend more on the way it would be codded than on the technology of storing data:

* JSON objects takes less memory. It's often not that much less, but having many objects in memory may make a difference
* JSON objects can be serialized and de-serialized easily. They can be used for GraphQL mutations without a need for extra conversion. That might be important if we're planing to rely on GraphQL as our main data API.
* It's much easier to attach reactivity to JSON objects than to more complex ones.
* It's easier to merge two simple JSON objects together. We can use third-party libraries for this (such as [Automerge](https://github.com/automerge/automerge))

But it is interesting to try new technology and GraphQL looks like as the next step for abstraction of data retrieval. It looks like to put the data retrieval workflow out of the client part to server part. It could be beneficial for us.

balmas commented 3 years ago

Some links for reference:

Original requirements for the wordlist: https://github.com/alpheios-project/components/issues/304

Architecture design discussion for the wordlist: https://github.com/alpheios-project/documentation/issues/9

balmas commented 3 years ago

As we proceed with this, I don't think we should assume that our current approach to local/remote database storage for the wordlist is something that we want to retain.

Due to the syncing requirements across applications and devices, at the moment, the only scenario in which the local indexedDb provides any value is in the scenario where for storage to the remote database stops working somehow mid-session without the user knowing about it (but the lookups otherwise continue working -- so not a working offline scenario). This is an edge case, and it doesn't really justify the user of indexedDb for the wordlist as it currently stands.

That isn't to say we don't want to make better use of indexedDb generally across the application, to improve performance and reduce remote lookups, because I think we do, but that should probably be tackled separately.

balmas commented 3 years ago

Stepping back to look at the overall architecture framed according to the philosophy outlined at https://khalilstemmler.com/articles/client-side-architecture/introduction/ here's what I think it looks like (the core library -- I'm not trying to represent the client applications of alpheios core here)

architecture_3_3_0

I think, at least according to this architecture approach, are model and presenter layers are not separate enough, and within the model, there isn't much clean separation of concerns between interaction and infrastructure.

balmas commented 3 years ago

Discussion from Slack:

@balmas

I do wonder however if our decision to focus on the wordlist in its current state as the test case for adding GraphQL to our infrastructure might not have been the best choice. I think we have clear need for it in the lexical query, which combines the results of many different microish services (both back-end and front-end) to produce a single composite object. I'm having a harder time understanding its role in the wordlist

@kirlat

I agree with what' you're saying about using GraphQL to store results of lexical queries. For me, lexical query is somewhat connected with the word list. Looking at it from a very high point of view and abstracting things away I think we need the following:

Some place to store results of lexical queries. We can use GraphQL to store this lexical data.

Lexical query retrieves lexical data and puts it into this storage.

WordList and SessionHistory keep references to the lexical objects (we can use the same unique IDs that are used by Apollo cache for that) placed to the storage and, if necessary (as a response to user actions), can retrieve lexical data from the storage. So the lexical query and the word list work with the same storage. It's just the lexical query mutates it and the word list queries it. Also, if a UI component like a popup wants to display a homonym, we can pass it a reference to the homonym object in the GraphQL storage and a popup could use GraphQL to pull the data out. That would make data flows between app components and modules much simpler and components would become much more independent. What do you think about this concept?

balmas commented 3 years ago

I agree (1) that the Lexical Query, Wordlist (and the future Session History) are all related in that they make use of domain data (which is retrieved from multiple sources and currently aggregated in the Homonym data model object) and (2) that it does make sense for Apollo Client to provide a Facade API (using GraphQL) to the shared domain data objects

I would like to be a bit more cautious about the jump to Apollo Client for State Management and direct access from the Vue components to the GraphQL storage. Ultimately that may make sense, but I think we have a fair amount of detangling to do first.

balmas commented 3 years ago

WordList and SessionHistory keep references to the lexical objects (we can use the same unique IDs that are used by Apollo cache for that) placed to the storage and, if necessary (as a response to user actions), can retrieve lexical data from the storage.

I think one thing that worries me about this is that we don't really have unique IDs for some of the data we're talking about.

kirlat commented 3 years ago

I think one thing that worries me about this is that we don't really have unique IDs for some of the data we're talking about.

We have them for Lemma, Definition, and TextQuoteSelector and I think we can add it to more classes using the uuid. IDs won't need to be necessarily meaningful, we can use just random strings and be fine with it. What do you think?

balmas commented 3 years ago

We can of course create a UUID for anything, but in some cases, they may not be helpful for reuse of data across sessions or even across word lookups.

For the user wordlist, where we aggregate results for a single word form from whatever context it was looked up in we are able to use a combination of the language code and word form as a unique identifier for the word.

But the specific lexemes that result from a word lookup can depend upon the context of the word, so we will need a more complex solution.

On the server side, we cache requests based upon http request parameters.

We will need to look at each data object we store and decide when and how it will be reused and develop the identification key accordingly.

kirlat commented 3 years ago

But the specific lexemes that result from a word lookup can depend upon the context of the word, so we will need a more complex solution.

For the lexemes that depend on the context, can we use language + word + hashes of texts of pre and post text selectors?

We will need to look at each data object we store and decide when and how it will be reused and develop the identification key accordingly.

I have a feeling that this is an essential part of a solution. Once we decide it, other components may fall into their places much more easily.

Can we approach this using a Domain-Driven Design principles? First we could define events, then commands, and data objects may arise from an intersection of those two (i.e. from the aggregates). Of course, that would require to define domain boundaries beforehand, which may be the hardest part, on my opinion.

balmas commented 3 years ago

For the lexemes that depend on the context, can we use language + word + hashes of texts of pre and post text selectors?

Possibly

Can we approach this using a Domain-Driven Design principles? First we could define events, then commands, and data objects may arise from an intersection of those two (i.e. from the aggregates). Of course, that would require to define domain boundaries beforehand, which may be the hardest part, on my opinion.

Yes, I'm working on this now. And you're right defining the domain boundaries is very difficult.

balmas commented 3 years ago

Ok, the results of my first pass on the domain design for this is at

https://github.com/alpheios-project/documentation/blob/master/development/lex-domain-design.tsv

I've taken some liberties with the approach, by including the business processes as actors, supplying preconditions where appropriate and distinguishing between originating and affected views.

Also, our use of data is a little different than a traditional scenario for this sort of design approach, in that the queries are not just producing views on user created data. Up until now, except for the user word list, we have been a query-only system when it comes to domain data.

With the introduction of the features that allow users to annotate the results of queries, and create their own data, we now have queries populating data objects that then may be "corrected" and saved by users as their own. So what I have done is to model both the Query and the User as potential actors on commands which create data. This may not be a proper application of the domain driven design approach but I couldn't find any examples that showed how to apply the approach to our scenario and I wanted to reflect the "creation" of the most granular level of the domain model elements from the query data because that's what's happening in our system -- I..e we aren't just creating views on already existing data -- we're aggregating data objects from the results of queries and then doing further operations on those data objects.

@kirlat and @irina060981 let me know what you think.

kirlat commented 3 years ago

Thank you for the model! It's very helpful in understanding what we do on different levels. I will study it carefully but it already helps to see some things better.

As you've noted (I did not really think about it before), our queries do not always produce views. What we do is, I think, create new data out of the existing pieces. Once it's done, we display some pieces of data to the user. So maybe we can make a distinction between these two groups of operations (data synthesis and data display)?

Data synthesis:

Retrieval of data pieces from various sources, including remote storages and user inputs (if user suggested some corrections)
Combine and transform the data obtained to create new data.
Store new data in memory, in local and remote long-term storage (optionally).

Once the new data is constructed and stored, we can query it and display to the user as in the traditional scenarios. There is probably nothing special about this part of the workflow.

If user decide to correct the displayed data, we go to data synthesis again. Once data modification is done, we use the presentational workflow part to display it.

Would separation like that make sense? Can we consider those two areas as separate domains (we can subdivide them further)? Can we separate those two areas clearly? I think that can make things simpler. What do you think?

balmas commented 3 years ago

I like the "data synthesis" description.

I think we won't really know until we get into the details of the user annotations whether this separation between synthesis and display will be sufficient but I think it is a good place to start.

kirlat commented 3 years ago

The more I read about DDD and other related concepts, the more I feel that the right solution would be to put a lexical query behind a GraphQL facade. What user, in most cases, wants from our system? A WordItem: an object representing a word within its context, containing a word itself, the context, several Lexemes with Lemmas, inflections, definitions, translations, etc.

To get a word, user makes a (conceptual) getWord GraphQL query to the facade. The user supplies a word itself, the language, the context, and whatever else is necessary. The system behind the GraphQL facade should return a word that was requested. If the word was already looked up within the same context and is stored, the system will return it from the storage, whatever the storage may be: an in-memory cache, an IndexedDB, or a remote service. If the word is not stored yet (which will happen most often), a system may return something like an empty Word object with placeholders and the loading and error status props for the items requested (lemmas, inflections, definitions, etc.), in the spirit of Apollo queries:

status: {
  lemmas: {
    loading,
    error
  },
  shortDefs: {
    loading,
    error
  },
  fullDefs: {
    loading,
    error
  },
  ...
}

Those status props would be reactive. When data is loaded, the loading fields would switch from true to false. The client modules interested in the corresponding data would monitor the loading props. Once loading is complete, they will check an error field first. If no errors occurred, they would access the loaded data and display it to the user. If there was an error, they would display an error info, or handle the error in some other way.

That would effectively make a live WordItem object. A lot of details are missing, or may be done differently, but this is a concept of how we could do it in the simplest way possible.

Having the word query in a GraphQL facade would allow to have the lexical data retrieval happen client side or server side, fully or partially. Switching between them would require no client code changes at all.

What do you think about it as of a concept?

balmas commented 3 years ago

This is generally what I was thinking as well.

kirlat commented 3 years ago

Working with implementing GraphQL functionality made me feel that our data and business logic are too tangled. Adding GraphQL to our current architecture of modules would, instead of making things simpler as GraphQL might do, make it even more complex. In order to avoid this, I think, we need to make work to optimize the structure of our application in parallel with the implementation of the GraphQL API.

I came up with the following architecture, that, I think, would allow to separate data from logic and divide different kinds of logic from each other: It resembles some kind of MVP architecture with the presentation level shown at the bottom and the model displayed on top.

I think that we would need to separate an application logic, that defines how the application responds to the user interactions and lifecycle events, from the domain (business) logic, that define rules related to assembly and usage of lexical data. I think we should make the separation between not only logic, but between different types of data (state) as well. For example, data about words queried should belong to the data module layer, but an information about user preferences (e.g. how long is the moseover delay) should belong to the application layer.

An application and data model layers would communicate via GraphQL API (data model layer would expose one or several GraphQL endpoints) or via the JS method API (that's simpler to implement and we can use it as a temporary solution, we have something like this already). I think it would be beneficial to switch to GraphQL API completely as it would allow us to have some, or, if necessary, all parts of the data model located on the remote servers.

Please let me know if this model makes sense. If it does, we can gradually change our modules to fit the model until our architecture would match it completely.

balmas commented 3 years ago

As discussed in today's check-in, we agree generally on this model, although we need to move incrementally towards it, reusing existing code sensibly.