CDICT Service Architecture

kirlat commented 5 years ago

As the time to provide a solution is very limited I think we should stick to the MVP approach. The simplest solution would be to provide a client side service that will download dictionaries and serve them from memory. Later we can add a server-side service to that.

Most important architectural aspects, on my opinion, are:

How to download dictionary data effectively. It must work well over low quality connection. It must not re-download large amounts of data if connection is interrupted.
Where to place the data that will be used to serve incoming data request from business objects. We shall try to minimize the response time while keeping memory footprint within allowed limits of the device.
To avoid re-downloading data across browser sessions we shall store it permanently on the device of the user, if space allows so. What can be the most efficient way to do so?

(1) To download the dictionary file(s) effectively we can:

Provide an ability to resume downloading if connection is interrupted. If that is not possible, we can split the file(s) into several chunks so that only the chunk that suffered an interruption will be re-downloaded.
Possibly compress dictionary files to minimize download size.
Ideal solution would be to use a Background Fetch API but it is available only in Chrome at the moment, alas: https://medium.com/google-developer-experts/background-fetch-api-get-ready-to-use-it-69cca522cd8f

(2) The simplest and probably most suitable for MVP solution would be to load downloaded data into memory. That will take several megabytes of RAM per page but it would be the fastest way to serve data. To minimize memory footprint we can store data to local permanent storage (i.e. IndexedDB) and request data from there. Or we can use a mixed approach: store dictionary files to indexed DB but keep indexes in memory.

For MVP, I think, it will be enough to go with an in-memory approach. We shall download data files, convert them to JS structures, and store them in memory.

(3) The only viable option to store data permanently is to use IndexedDB (which shall work just fine for us). To avoid storing several instances of data (i.e. one per domain) we can use a technique similar to the one employed by Safari authentication:

Create a subdomain on Alpheios website, something like cdict.alpheios.net
Add an iframe that will open that Alpheios page within each web page where webextension or content script are activated.
Make an iframe document load the script that will download a dictionary and store it in IndexedDB.
A script from the main document can request lexical data from an iframe using postMessage(): https://developer.mozilla.org/en-US/docs/Web/API/Window/postMessage As a result, we will have a single copy of data in IndexedDB that will serve both webextension and embed lib on all pages.

@balmas, @irina060981: what do you think?

irina060981 commented 5 years ago

I didn't quetly understand how do you suggest to retrieve data from the IndexedDB in another domain.

We have the folowing parts according to ClientAdapters architecture:

Lexical Query asks for a data to ClientAdapters using the name of the desired adapter and design method
- ClientAdapters creates inits a request to the adapter using its name and method
- Current Adapter creates request url and asks the Service with the given URL -Current Adapter retrieves data from the Service and parses the result to data-model Objects
- CurrentAdapter -> ClientAdapters -> Lexical Query returns DataModel objects

If we place all the data to IndexedDB to the another Domain:

then all operations with extract data from the IndexedDB would be on the Service side too?
Current Adapter would create calls to what then? Do you suggest to make them Platform specific using postMessage? But LexicalQuery doesn't know about the Platform (I believe)

I think we should decide

what data the Service would return,
then decide where would be the operations with IndexedDB,
then what would be the input/output of the servive

I think that it is a good way to go to place all the operations with IndexedDB on the Service side (from several aspects - it logically looks goood there, it allows to work with one IndexedDB instance - as you described, later migration would be much easier, client's code would be much lighter and easier). But I have no ideas how it could be convert later to non-background strategy for Embed-lib .

kirlat commented 5 years ago

I decided to create a little diagram to explain the idea better:

There is a target web page (A) where either a webextension's content script (C) is injected or an embed lib code (D) is loaded. In addition to that a content script (C) or an embed lib (D) adds to the target page (A) an iframe (B) that loads a script for handling CEDICT requests (E). Ideally, the script (E) would be injected by either the content script (C) or the embed lib (D) but I'm not sure they'll be able to do it cross-domain. So the realistic option is probably to have the CEDICT script (E) be loaded by the alpheios web page (cedict.alpheios.net, B). Thus, an indexedDB will be created by the CEDICT script (E) for the cedict.alpheios.net domain.

Since every page where webextension or embed lib is activated will have the same iframe (B), the IndexedDB will be shared across pages.

I was not thinking about the exact architectural implementation because I was trying to offer for discussion just a general concept. But I think the communication with the CEDICT script can be done by the specific CEDICT client adapter. That adapter will not be platform specific as postMessage() is supported across all major browsers: https://caniuse.com/#search=postMessage.

Here is how it might look like:

Lexical Query asks ClientAdapters for lexical data using the CEDICT adapter name
CEDICT client adapter from a web page (A) sends a postMessage() cross-domain request to the CEDICT script (E) that resides in an iframe (B)
CEDICT script (E) retrieves data from memory of from IndexedDB (F)
CEDICT script (E) responds to postMessage() by sending a response across domains to a web page (A) to the CEDICT client adapter
Lexical data is traversed to the requester following the standard CurrentAdapter -> ClientAdapters -> Lexical Query path.

Everything is working client-side, there is no across the network interactions. CEDICT data is downloaded by the CEDICT script (E) upon its first initialization. It is stored permanently at the IndexedDB (F). If CEDICT script (E) notices that the CEDICT data is missing or incomplete, it will download the missing parts and will store them in the IndexedDB (F).

I don't have a proof of concept code yet, but I think this might work. The similar schema (without the IndexedDB though) is used in the new Auth0 SPA workflow. So if we decide the architecture is worth implementing, I can build a quick proof of concept and then, if successful, roll it out into a full scale code piece.

Please let me know what do you think. Is the concept clear enough? Do I need to provide any additional explanations?

irina060981 commented 5 years ago

@kirlat, I didn't look to the Auth0 SPA Code, so may be I don't have enough knowledge about such an approach, but I don't understand how exectly it would work.

postMessage is eventBased current clientAdapter implementation is promiseBased (lexicalQuery is PromisedBased too)

that is why we could easily use clientAdapter not in window environment

do you suggest to create a PromiseBased implementation over postMessage or change cleintAdapters to eventBased for a specific adapter? and also update LexicalQuery

balmas commented 5 years ago

I think the use of iframes and postMessage is very interesting. @irina060981 is correct that it will require some changes/additions to the Client Adapters library to support event based messaging since it is promise based right now, but I don't necessarily think that is a bad thing.

kirlat commented 5 years ago

Since client adapters are using asynchronous architecture which is promise based it will not require any changes at all, I think. Here is how it might work with the help of the messaging service similar to the one used for communication between content and background scripts in webextension.

A requester (LexicalQuery on behalf of a business component) requests data from the CEDICT Client Adapter.
CEDICT Client Adapter creates a new promise and returns it to the requester.
CEDICT Client Adapter asks the messaging service to send a postMessage() request.
The messaging service generates a message ID, records it, and sends along with the request data to the CEDICT service script via postMessage(). It returns an unresolved promise to the CEDICT Client Adapter. The promise object is stored having the corresponding message ID as key (i.e. in a map).
CEDICT service script receives the message, retrieves lexical data, and sends it back to the messaging service along with the message ID via `postMessage().
The messaging service receives the response, finds the message ID, and resolves a corresponding promise with the lexical data.
CEDICT Client Adapter receives lexical data via a promise resolved.
CEDICT Client Adapter resolves a requester's promise by sending lexical data to the requester.

This does not require any changes to the Client Adapters or any other architectures (but we might make them if we want to). All we need to do is to:

Create a new CEDICT Client Adapter which can be really lightweight.
Create a messaging service wrapper around the postMessage() communications. We can probably use an existing one that is used for communications with the background script with minimal changes.

The key here is a messaging service that is a asynchronous promise based wrapper around the postMessage communication layer. It isolates postMessage events from the rest of the Client Adapters library.

Please let me know if you see any issues in the schema described above.

balmas commented 5 years ago

I think this is worth experimenting with a bit. If it proves to work, it might be a good way to deal with other files, such as the dictionary short definitions and index files, and maybe also the grammars.

irina060981 commented 5 years ago

@kirlat , you have described a half of the promiseBased implementation over postMessage. The other half should have a rejection conditions. In our case it should return some rejection, I believe, if there are no lookup chinese words in the sourse index data. And I think I don't have any other issues about your suggestion.

balmas commented 5 years ago

good point. yes we have to have rejection conditions handled too.

kirlat commented 5 years ago

In our case it should return some rejection, I believe, if there are no lookup chinese words in the sourse index data.

It's all very subjective and something of a consensual agreement, but on my opinion rejection should mean that we have an error condition that disrupts the normal app flow: the service is down, the DB is not available, and so on.

If there is no data for a given word then that's nothing out of the ordinary and in that case we probably shall return just an empty object resolving the promise. But of course, if we expect to have all the words in our dictionary and then we did not found anything that could be treated as an error too. But I'm not sure if we shall expect that every word will be found. @balmas, @irina060981, what's your opinion on that?

balmas commented 5 years ago

We do need to be able to differentiate these scenarios:

Error because no service is down or db is not available
Word not found

I think right now in the UI we treat both of these the same but they are distinct error conditions.

kirlat commented 5 years ago

So the word not found is more like an error condition for us? In that case we can create several different error classes that will expand the core Error and reject with the appropriate error class depending on type of an error (service down, DB not available, word not found). A client may use instanceof to check what type of error it is.

balmas commented 4 years ago

implemented in release 3.3.0

alpheios-project / documentation

CDICT Service Architecture #16