Porting support for Chinese

balmas commented 4 years ago

We would like to port the prototype Alpheios 1.0 support for Chinese using the CDICT dictionary to the Alpheios 3.0 code. This issue is to discuss the design choices.

The prototype 1.0 code is at https://github.com/alpheios-project/ff-extension-chinese

https://github.com/alpheios-project/ff-extension-chinese/blob/master/content/alpheios-chinese-langtool.js may be useful in constructing the new chinese language model class.

https://github.com/alpheios-project/ff-extension-chinese/blob/master/content/alpheios-chinese-dict.js shows how we handled the lookup in the chinese language resources.

We need to design:

Chinese Language Model class in data-models
Client adapter(s) for the language resources
Possibly a service to serve the language resources

The biggest design decision we need to make is how to serve the language resources.

The CDICT data provides the following:

lemma (hdwd)
pronunciation
frequency
radical
stroke count
short definition

In the 1.0 architecture, we searched the data files and constructed our "Alpheios" data object all client-side, in the language model classes.

In the 3.0 architecture, we have a stricter separation:

the language class in the data-models library describes the features and capabilities of the language including text direction, selection context, encoding conversions, etc.

the morphological information, including some short definitions are retrieved from services and converted to Alpheios data model objects via the morphology client adapter

definitions are retrieved from services via the lexicon client adapter and added to the Alpheios data model object

Our chinese data is a little different than the other languages in that we don't currently have a server-side service that takes a word and produces a response. So we could either build a server-side service to do this (implementing the lookup business logic from https://github.com/alpheios-project/ff-extension-chinese/blob/master/content/alpheios-chinese-dict.js) or we could keep the data lookup client-side, loading the data files into the browser as we did in 1.0, and build a client adapter that retrieves from memory rather than a remote service.

Additional differences from the other languages worth noting:

the lookup algorithm must take context (i.e. surrounding words) into account because we may need to combine the selected character with the one before or after it to get the correct meaning
the lookup action may need a different trigger action -- in 1.0 we used mouseover rather than double-click for Chinese, because it was more ergonomic
the lookup action needs to work with both traditional and simplified character sets
the selection action must use character-based word separation rather than space-separated

balmas commented 4 years ago

@kirlat and @irina060981 I would like you both to take a look at the 1.0 code and design questions referenced above and let me know your thoughts on the data lookup design. Thanks!

kirlat commented 4 years ago

There are different ways we can handle language resources. We probably do need to select the one we will use for Chinese; however, a flexible architecture allows us to provide several non-conflicting ways to do that. We can use different configurations for different apps (webextension and embed-lib) and for different target audience (i.e. we can have a build targeting those interested in Chinese particularly). A flexible architecture will allow us to avoid code fragmentation: setting a different configuration of a build might be a matter of changing just a few settings.

Here are some different thoughts on what will we need to do.

Bundling Chinese dictionary into the webextension is the simplest approach. It will also work at the speed of light because all resources will be in-memory. However, this will increase the size of a webextension bundle significantly. In Alpheios V1 Chinese language resources takes 10+ MB. Adding them to the current build will increase it's size fivefold at least. Those who are not interested in Chinese would still need to pull the whole data volume.

Another approach could be to build webextnesion without Chinese language data and provide this data as a separate extension (i.e. have a Chinese language resources webextension). Only those who need Chinese will install those. When both extensions are activated, an Alpheios webextension will recognize the presence of the Chinese data and will use it locally. This is flexible enough at the cost of increased complexity (we'll have to manage two webextensions instead of one and both webextensions must be compatible with each other).

One more way to go is to load on demand. Once Chinese is selected in options, a webextension will start loading the Chinese data files. Those will be placed into the indexedDB. This approach is very flexible with two obvious drawbacks. The first one is that a user has to wait initially once data loading is complete. The second is that an indexedDB storage can be purged if there is not enough space on device and data has to be downloaded again. We will also have to maintain a service that will serve a data file (this should not present any real challenge though).

The last approach I can think of is the one we're using for Latin and Greek, were data is pulled from several remote services (a microservices architecture). This would be the most complex approach as it will require us to create and maintain fairly complex remote language services for Chinese. On the plus side, it is probably the most flexible approach.

Our current architecture, maybe with slight modifications, can accommodate all approaches, as I think.

This is how, on my opinion, it all can look like (the text might be too small to read when embedded into the post; clicking on the image to open it can make it better):

The language layer contains the knowledge about the language by either having it within the object (i.e. knowing what features does the language has) AND knowing how to obtain full language data (that's what we don't have there now). To achieve the latter we can pair a language data model with the proper language adapter that will know where to obtain the data, whether it be a local data file or a remote service.

The next layer can be a language data layer. It will contain the knowledge of where the actual data is located (locally or remotely), how is it architectured (in one or several data files or as a remote service), in what format it is and how to convert it to the format that the language layer and business logic components will understand (i.e. convert it into the form of objects like Lexeme, Inflection, etc.)

I think it will require the following changes (not major, on my opinion):

A lexical query shall be initiated by the language model. I.e. we will ask the language model to give us a lexeme, or other things such as inflections.
A language model will have config options that will specify what data source to use. According to those options, a language model will load a proper language data adapter, that, in turn, will retrieve the data.

This way all info about how we store and obtain language data will be isolated within the language model and (mostly) within the data adapters: language datasets (the could be called local data adapters in such approach) and client adapters (i.e. the remote data adapters).

This is a very rough description, but I hope it will be enough to convey the idea. If we decide it might be workable, we can develop it further.

Please let me know what do you think.

balmas commented 4 years ago

thank you @kirlat for your thoughtful response!

Another approach could be to build webextnesion without Chinese language data and provide this data as a separate extension (i.e. have a Chinese language resources webextension). Only those who need Chinese will install those. When both extensions are activated, an Alpheios webextension will recognize the presence of the Chinese data and will use it locally. This is flexible enough at the cost of increased complexity (we'll have to manage two webextensions instead of one and both webextensions must be compatible with each other).

Interestingly, this is the approach we took with Alpheios 1.0, where we had a base extension that contained the common/shared functionality, and separate extensions for each language. We then created collection bundles to facilitate installation. We had this grand vision that lots of people would want to create separate language extensions that could plug in to Alpheios :-)

balmas commented 4 years ago

I would like to think a bit about your suggestions. More soon.

irina060981 commented 4 years ago

I think, that @kirlat suggested an interesting approach to solve a lot of tasks and add several new features! From my point of view the main purposes (piority) should be defined:

increase webextension/embed-lib work speed, ability to handle with more one-time users Then we should go by "API way" - move services to server-side, optimize them and make the extension lighter
make extension more flexible and easier customizable by developers Then we should create different ways of rearranging, adding new features.

About Chinese: As I understood for now it is only the part of services and it could be increased (developed more) in future. So from my point of view it would be easier to develop and optimize as a server-side service. About the lookup - if I understood right, it couldn't be looked up from the lookup panel, because it should have context. So we would need to optimize lookup service. Also we should update creation of HTMLSelector, TextSelector as it is not space based, that's why it couldn't used common browser techniques (but I am not sure).

About uploading to IndexedDB. For now we coontinue to use IndexDB from the content part (not background), so it is domain-depended and if we wan to store it in IndexedDB - we should work first on moving handling with IndexedDB to background for extension (and not for embeded-lib).

I think that @kirlat approach is a good one to start if we want to give users, developers different ways to handle with data. And there are different practice where to put calculations - on client, or server side. I like client-server architecture but I think that we should go from requirements and purposes.

balmas commented 4 years ago

A lexical query shall be initiated by the language model. I.e. we will ask the language model to give us a lexeme, or other things such as inflections.

A language model will have config options that will specify what data source to use. According to those options, a language model will load a proper language data adapter, that, in turn, will retrieve the data.

Regardless of how we decide to implement the chinese, both of these are desirable architectural improvements. There are a few open issues for the other languages that would benefit from this design. See for example alpheios-project/components#835 alpheios-project/components#834 and alpheios-project/components#269

balmas commented 4 years ago

About the lookup - if I understood right, it couldn't be looked up from the lookup panel, because it should have context. So we would need to optimize lookup service.

The lookup panel can still be used, and in this case, we would assume that whatever the user entered was the entire word. The issue with the context is that in Chinese, some words are made up of more than one (usually not more than two) characters, and when a user selects a character from the page, we don't know immediately if it is a single character or double character word. But if it entered in the lookup, it is probably an entire word, regardless of how many characters.

balmas commented 4 years ago

About uploading to IndexedDB. For now we coontinue to use IndexDB from the content part (not background), so it is domain-depended and if we wan to store it in IndexedDB - we should work first on moving handling with IndexedDB to background for extension (and not for embeded-lib).

Per discussion in today's check-in, we probably need a general design for moving things that can be shared across tabs into the background script, and this needs to take into account what the alternative would be for the embedded library. See relevant issue alpheios-project/webextension#117

balmas commented 4 years ago

Per discussion in today's check-in, we want to proceed to try to have a prototype ready by the week of November 11.

The approach is to start from the client side, using a mock for the service response, while we think a bit more about the approach to the actual serving of the dictionary data.

I think we should proceed as follows:

create a ChineseLanguageModel class in data-models (and update the LanguageModelFactory and constants etc. for the new Language Code. 'zho' should be language code we use by default, but we should also recognize 'zh' and 'chi' as valid language identifiers for chinese)
create a mock CDICT engine class in client-adapters/tufts/engine that just returns a hardcoded homonym object. It will need the properties described above (lemma (hdwd), pronunciation, frequency, radical, stroke count, short definition) . This might require some modifications to the Lexeme data model class.
implement the doCharacterBasedWordSelection method in components/selection/media/html-selector (I suspect that the browser getSelection technology has made much of what we did before obsolete, but in case it's useful for reference, see https://github.com/alpheios-project/ff-basic-reader/blob/master/content/alpheios-lang-tool.js#L664-L746)

I think a good division of responsibilities might be for @irina060981 to get started on the above tasks, while @kirlat takes a look at the business logic behind the use of the CDICT dictionary in the Alpheios 1.0 code (https://github.com/alpheios-project/ff-extension-chinese/blob/master/content/alpheios-chinese-dict.js) and begins to work on turning that into a service that can serve the dictionary data. In working through that, it would be good to think about whether it would be possible to code it in such a way that it could be served either as a "local service" packaged in the webextension or as a remote service.

Let mek now what you think @kirlat and @irina060981 . Thanks!

balmas commented 4 years ago

one other note -- we will prioritize prototyping in the webextension over the embed-lib for now.

irina060981 commented 4 years ago

I started to work on it with ChineseLanguageModel

kirlat commented 4 years ago

Will start working on the business logic behind the use of the CDICT. Please ignore my previous edit as I was somehow looking at the stale version of the issue thread 🙂

balmas commented 4 years ago

implemented in release 3.3.0

alpheios-project / documentation

Porting support for Chinese #15