WeblateOrg / weblate

Web based localization tool with tight version control integration.
https://weblate.org/
GNU General Public License v3.0
4.49k stars 993 forks source link

Add morphology support to glossary and search #3023

Open tariver opened 5 years ago

tariver commented 5 years ago

Is your feature request related to a problem? Please describe. Base language of my project is Russian so my glossary is Russian>Other language But some terms are not shown if they are in different case. As an example: Our project frequently uses the word "карта" (map). So I made a glossary entry for term "карта". If the source string contains this term in nominative case (карта) the glossary shows it without problem. But if the source contains this word in genitive case (карты) the term is not shown in the glossary tab. Although sometimes it works. For example glossary term "справочник" (reference) shows up if the source contains plural form "справочники". The same goes to the "Search" functions - it doesn't search for all the forms of a word.

Describe the solution you'd like There is an open source spell checker Hunspell which has a lot of dictionaries created for it (for example here) and it would be nice to implement it's morphology dictionaries into Weblate's glossary and search functions.

Describe alternatives you've considered An alternative is to input all the word's forms into glossary, but it's a cumbersome task.

nijel commented 5 years ago

Right now it might be doable with Whoosh stemmers, in the long term better approach will be to rely on fulltext capabilities in PosgtreSQL, see https://github.com/WeblateOrg/weblate/issues/2825.

tariver commented 4 years ago

Just to clarify - is there a way to implement this right now or Weblate's code needs to be changed? I had a look at the Weblate files and these stemmer files are present in the whoosh folder.

nijel commented 4 years ago

There is probably way to implement it in current code base using Whoosh (though it's currently not used for glossary lookup). However it's probably not that good idea to spend effort on this right now, given that this code is going to be replaced by using PostgreSQL features (see #2825).

northantech commented 3 years ago

The same issue applies to Turkish language as well. Just as a user, I'm not very informed of the source code, but I believe agglutinative language support would be awesome in glossary.

Cerno-b commented 11 months ago

It even is a problem in English. If you have entered a singular term into the glossary, then plural versions of the word do not find the glossary entry.

Cerno-b commented 11 months ago

@nijel Do you think this is going to be addressed?

I am currently cleaning up our glossary and there are a lot of plural forms that I would like to transform to singular in order to have a cleaner glossary overall. If I understand this correctly, then in order to find a glossary term in the sidebar, I would have to add two glossary entries, one for the singular and one for the plural if I want to make sure that the word appears in the sidebar in all cases. Is that correct?

A solution that will probably not help OP, but it could lead to a lot more hits would be to trigger a sidebar term if a substring of the source word is in the glossary. That would mean that a glossary entry for "asset" would show up in the sidebar for a source text that contains "asset" or "assets". It would probably lead to some false positives, but I think the benefits outweigh the disadvantages.

Having an extra field where the user can add morphological alternatives to a word would probably be a better solution, but I guess that would be a lot more effort as well. :/

nijel commented 11 months ago

I don't think substring matching is a way to go, as that would lead to too many false matches.

You would rather not see “map” glossary term when there is “maple” in the source.

Cerno-b commented 11 months ago

@nijel I agree that it's not ideal, but as a quick fix, I would probably prefer some FPs to missing out on half of the glossary entries if all plurals will be hidden from the sidebar.

Of course proper morphology would be better, that's why I wondered if this is on the roadmap in the near future.

nijel commented 10 months ago

I don't think it's a reasonable thing to do even as a quick fix, as it will produce many false matches. Actually, it would only produce false matches for most of the languages (see Russian example in the initial post).

Cerno-b commented 10 months ago

Sure, I'm fine with that, and I'm not really a fan of half-baked solutions myself, so it was a questionable idea from the start I guess. So since proper morphology support sounds to be a lot of effort, do you think the multiple value approach discussed in #7416 is a viable alternative that takes less effort to realize?