full unicode support - Githubissues

domschrei / krunner-symbols

A lightweight KRunner plugin (Plasma 5) to retrieve unicode symbols, or any other string, based on a corresponding keyword.

GNU General Public License v3.0

118 stars 12 forks source link

full unicode support #4

Closed Thomqa closed 6 years ago

Thomqa commented 7 years ago

with aliasses and search patterns in the middle of words should also match.

domschrei commented 7 years ago

I like the idea of supporting unicode, and I see the following challenges:

The complete set of unicode characters is very large (the current version includes 128.172 characters). Does anyone know of any databases / structured files / applications which provide a complete list including the corresponding descriptions?
The krunner plugin should not search the entire Unicode domain every time a character is being entered into krunner, as I consider this to be far too heavy and I think it would produce a large number of "false positive" results for very many krunner use cases. Some keyword or character in front of the actual search word / pattern could work, in order to explicitly tell krunner that you are looking for a unicode symbol. (Just an example: enter unic:box and you would get all unicode symbols with "box" in their description.)

Any thoughts?

domschrei commented 7 years ago

I found the official files in CSV-like structure. There is also a convenient, dictionary-like index file which might be really useful. I'll look into it.

domschrei commented 7 years ago

I created a branch with initial unicode support. It works with a glossary-like text file from unicode.org. These are not all unicode symbols, but it seems to be a reasonable collection of symbols with meaningful short descriptions.

The plugin loads the entire glossary into memory when launched (< 1 MB) and then can search the entered term (non-fuzzy, case-insensitive, also matching substrings) and returns the found symbols. The priority of each of the results is something that needs to be implemented yet (a floating-point number between 0.0 and 1.0 which should be higher the likely this is the result you were looking for). Also, the symbol definitions in krunner-symbolsrc need to be cut accordingly, so that there are no duplicates.

Thomqa commented 7 years ago

Nice! I will install it this weekend to test it.

domschrei commented 7 years ago

As of now, the plugin (inside the unicode branch) actually supports the entire Unicode database (i.e. it knows all definitions inside the official UnicodeData.txt file). The performance seems okay to me. I also implemented an advanced heuristic to sort the results from most to least relevant (though it might need some additional tweaking).

domschrei commented 7 years ago

The features have been merged into the master branch. Unicode support is disabled by default for now, but it can be enabled by a config setting (see the updated README). On that occasion, I have implemented a proper "cascading" configuration, where local definitions / settings will override global ones.

I'm not happy with the heuristic of relevance for the unicode symbols yet; I hope I can improve this soon.

domschrei commented 6 years ago

As mentioned in the change notes, the current release v1.0.4 now features a much better search and rank algorithm.