foss-ag / lib_unwort

A rule set-based language correction library.
Other
4 stars 2 forks source link

Dictionaries and Modularity #3

Open Scosh opened 6 years ago

Scosh commented 6 years ago

@LordSentox and I have been discussing the basic design of the dictionary files a bit.

We need to find good base dictionaries to use as a starting off point for all the languages we eventually want to spellcheck.

These dictionaries should be consistent accross said languages in regards to their 'scope' or rather linguistic 'register.'

They may later be extended with extra or alternative dictionaries that expand or change the scope of the language(s) currently being checked. The examples we came up with for this were:

  1. A dictionary that has some extra entries for terminology from a scientific discipline. E.g. linguistics or computer science.
  2. A dictionary that contains character names and other invented words for a specific fictional universe. E.g. Star Trek.

It needs to be decided how exactly the dictionaries will be segmented into different, modular pieces and how those pieces will be used to perform the spellchecks. That's what can be discussed here, among other related dificultires pertaining to the dictionary files.

lxndio commented 6 years ago

I find it to be very important to have dictionary files which can be constructed in a modular manner so that it can be extended if more specific parts of a language are required to be verified.

However, as I stated earlier on, it is very important to have the files be as abstract as possible so that it will be possible to extend support to languages that are very different from the languages we are used to (e.g. Chinese etc.).

I can agree with your proposal to be able to add different "packages" of words containing support for different language affairs.

I suggest to have a main file for each language defining the general rules Unwort will use to do its work. The additional packages may then include different rules for specific words that do not follow the general ones defined in the main file.

lxndio commented 6 years ago

After taking some time thinking about the configuration files, I am of the opinion that it may be the best approach to have a main configuration file which then defines other files required by Unwort to work with a given language.

This main configuration file could look like this (even though, we haven't yet settled on using YAML):

# Main Unwort language definition file for the German language

lang: German
lang-code: de_DE
version: 0.1
alphabet:
  - latin
  - german_specialchars
definitions:
  - capitalization
  - word_composition
dictionaries:
  - basic
  - science_computer-sciences
  - science_linguistics
  - television_star-trek

As one can see, I propose that we seperate required files into, what I call, definitions and dictionaries.

A dictionary file could look like this:

word_mensch:
  word: 'Mensch'
  type: noun
word_gehen:
  word: 'gehen'
  type: verb
word_schnell:
  word: 'schnell'
  type: adjective

Because the type of each word is defined, we could for example define that in the German language each word of the type noun is written with a capitalized first letter. Also, this way we would be able to define different other parameters for each word, for example if a word is composable into a new one, or if it is possible to hyphenate the word and where (by defining the syllables of a word). These information can then be used in a definitions file.

Let me know what you think.

lxndio commented 6 years ago

The following is especially interesting for @Scosh:

As I have discussed with @LordSentox today, it is necessary to make the dictionary files a bit less bloated. That means, the example shown above will be shortened to:

Mensch,n,m;
gehen,v,#;
schnell,a,#;

In this case, the syntax for a word is as follows:

word,type,genus;

If the genus is not necessary to be specified (e.g. for verbs or adjectives) or if it is unknown, the hash sign is used. Same thing for the word type; if it is unknown for any reason, it will be signaled using the hash sign.

Edit: The use of the hash sign as a placeholder is not yet settled and may change if desired.

Scosh commented 6 years ago

As I have discussed with @LordSentox today, it is necessary to make the dictionary files a bit less bloated. That means, the example shown above will be shortened to:

Alright, that seems reasonable. Are you guys still thinking about making a configuration file similar to the one @lxndio proposed in the comment before this one?

word,type,genus;

We probably need to be clear from the get-go what we mean with "word-type," how many types we want or need and how to abbreviate them efficiently. In linguistics, the "type" of a word is commonly referred to as the word class or part-of-speech (POS). Python's NLTK for example let's you assign POS-tags according to the Penn Treebank definition. 1

This is a notoriously pretty convoluted, but as a result quite exhaustive and precise set of tags. There are many much simpler tag-sets that are usually sufficient, though.

Plus, Penn is optimised for English, and as you can see in the list, there's a lot of extra grammatical information (like plural/singular) embedded into the tags, which is actually what we want to avoid.

So what I'm saying is: let's not use Penn specifically, but let's use a suitable and well-defined list of German POS tags to imprint "type" information into these dictionary files. I might be able to find a good existing tag-set for us to use verbatim or modify. It could otherwise prove surprisingly difficult to come up with every conceivable word-type in German by ourselves.

If the genus is not necessary to be specified (e.g. for verbs or adjectives) or if it is unknown, the hash sign is used. Same thing for the word type; if it is unknown for any reason, it will be signaled using the hash sign.

Yes, I think I like this idea and its simplicity.

References Overview

  1. Alphabetical list of part-of-speech tags used in the Penn Treebank Project
Scosh commented 6 years ago

Because the type of each word is defined, we could for example define that in the German language each word of the type noun is written with a capitalized first letter.

We need to be careful not to oversimplify German capitalisation with a rule like this, because it isn't actually binary on the word-level.

All nouns in German must be capitalised. This is always true and seems simple enough. But what about when specific words are used as capitalised pronouns to refer to a person or a group previously established in a text? That's only one example (namely, D 76) of many cited in this Duden entry on the topic. 1

That rule for instance would imply that both jemand as well as Jemand (and all their grammatical derivations) are acceptable.

Technically, each version is actually only acceptable in it's own specific context, but since we understandably don't want to complicate things with sentence-level analyses just yet, our rule needs to, for now at least, allow for a group of words that are deemed correct in thier capitalised as well as thier lowercase forms.

As clever as you are, I don't know if you've both already thought of this and I am just needlessly blabbering on about nothing here … But I definitely wanted to mention it so it isn't generally forgotten.

References Overview

  1. Duden: Groß- und Kleinschreibung
lxndio commented 6 years ago

Alright, that seems reasonable. Are you guys still thinking about making a configuration file similar to the one @lxndio proposed in the comment before this one?

I guess you mean the following file.

# Main Unwort language definition file for the German language

lang: German
lang-code: de_DE
version: 0.1
alphabet:
...
definitions:
...
dictionaries:
...

Although I think that it would be really nice to have a file like this to define a language and link to other files that are important for working with it, we haven't yet discussed if such a files will be used. Probably @LordSentox has an opinion about this topic that he wants to share with us.

So what I'm saying is: let's not use Penn specifically, but let's use a suitable and well-defined list of German POS tags to imprint "type" information into these dictionary files. I might be able to find a good existing tag-set for us to use verbatim or modify. It could otherwise prove surprisingly difficult to come up with every conceivable word-type in German by ourselves.

If you'll be able to find such a list we could save ourselves from a lot of hassle. I did a quick search and found the STTS1 online (also take a look at the other sources linked below). It seems pretty complete but super complex as well. I don't know how many features of it we would actually use but probably we could just take the tags we need from it. I'm open to any suggestions as I am not a linguistic expert.

We need to be careful not to oversimplify German capitalisation with a rule like this, because it isn't actually binary on the word-level.

As clever as you are, I don't know if you've both already thought of this and I am just needlessly blabbering on about nothing here … But I definitely wanted to mention it so it isn't generally forgotten.

Of course, I thought about all German capitalization rules but I wanted to settle on one rule to use as an example on how the configuration files could work. Implementing each and every rule listed in the Duden2 will probably take us more time than it is worth.

References Overview

  1. Universität Stuttgart: Stuttgart-Tübingen-TagSet Table (1995/1999)
  2. Duden: Groß- und Kleinschreibung

More about the STTS.