Language-aware word counting

yheuhtozr commented 1 year ago

Describe the problem

The current Weblate UI only shows/calculates number of words split by spaces. This is particularly inconvenient for Asian languages that do not employ spacing. Weblate reports most of my source strings (which are in Chinese) as having only one word regardless of actual length. For example, a simple word like

开

and a whole paragraph such as

小娜在2014年4月2日举行的微软Build开发者大会上正式展示并发布。2014年中旬，微软发布了“小娜”这一名字，作为Cortana在中国大陆使用的中文名。与这一中文名一起发布的是小娜在中国大陆的另一个形象。“小娜”一名源自微软旗下知名FPS游戏《光环》中的同名女角色。

are both "single word" based on the current algorithm.

Describe the solution you'd like

East Asian languages count characters instead of words for the amount of text. So, displaying character counts is more relevant for those languages. However, the "character count" in practice does not mean the simple number of characters, but hybrid counting of Asian characters and alphabetic words. I believe MS Word and major commercial CAT tools do it this way, with slight differences in detail. Applicable languages include Chinese, Japanese, Korean (technically able to count words but character counting is more prevalent), and possibly Thai (though not sure).

Possible solutions are:

Universally calculating number of symbols used in those languages one by one in word count
Use different counting method according to language specified
Add several options for users to choose per project, category, etc.

Describe alternatives you've considered

No response

Screenshots

No response

Additional context

No response

nijel commented 1 year ago

Should be pretty easy to implement, I've just factored out word counting to a helper function in https://github.com/WeblateOrg/weblate/pull/10279. There is already Language.uses_ngram which could trigger this way of counting...

github-actions[bot] commented 1 year ago

This issue seems to be a good fit for newbie contributors. You are welcome to contribute to Weblate! Don't hesitate to ask any questions you would have while implementing this.

You can learn about how to get started in our contributors documentation.

yheuhtozr commented 1 year ago

@nijel

There is already Language.uses_ngram which could trigger this way of counting...

Thanks, great to hear. Does it already have some implementation with any effect?

nijel commented 1 year ago

It is purely based on the language code:

https://github.com/WeblateOrg/weblate/blob/4904d25b1b2e4520eaa353efcfa3577d0d9624f4/weblate/lang/models.py#L640-L641

yheuhtozr commented 1 year ago

@nijel Thank you. What I wanted to ask is actually, if I should change the list of languages in uses_ngram, would it have any side effect?

nijel commented 1 year ago

It affects glossary matching as well (for the same reasons as word counting), so it should have a positive effect on such languages (but it also applies only when used as a source language).

nijel commented 12 months ago

https://stackoverflow.com/a/16528427/225718 might also be an approach worth benchmarking against regex-based solution in https://github.com/WeblateOrg/weblate/pull/10284

yheuhtozr commented 12 months ago

@nijel Thanks for the pointer. I am having some rush work recently and may come back next week or so.

nijel commented 11 months ago

I originally wanted to reply at https://github.com/WeblateOrg/weblate/pull/10284, but I think the discussion more belongs here.

The first thing we should do is to specify what we want to count as a word. It is a metric which is used for several purposes (cost estimates while translating, it will be a pricing unit for some of our services).

For non-CJK scripts, I'd not change the current definition (see https://docs.weblate.org/en/latest/devel/reporting.html#number-of-words). There is no reason to change that.
For Chinese and some other languages this current implementation is useless, and we need to have something better.

Looking around, there are several approaches:

Count only ideographs separately (https://github.com/holmesconan/vscode-wordcount-cjk)
Count all CJK chars as words (LibreOffice, https://github.com/magiclen/words-count)
Counting on spaces only (Google Docs, https://support.crowdin.com/crowdin-word-counter/)
Doing some computation based on chars (https://help.transifex.com/en/articles/6212250-how-are-hosted-words-calculated)

Taking it language by language:

Chinese uses ideographs, but in most cases one word consists of more ideographs. There is study on that.
Korean has letters separated by spaces, so the current word count should work well, while the LibreOffice approach inflates word counts to match the number of characters.
Japanese is mixed. For Hiragana pretty much same as for Chinese applies, while Katakana should be merely accounted as characters (but there is no whitespace separating these).
There are definitely more languages which do not separate words and do not use any of the CJK (for example https://stackoverflow.com/q/4861619/225718).

In the end, I don't think we should implement the own word counting, as that is a pretty complex topic. Luckily, there is an existing standard and implementation with Python bindings.

The good thing is that it does split which seems to make sense for Chinese, Japanese, Korean and Khmer (note that I don't know these languages, so I might be easily wrong here), so that the hard work is addressed.

The bad thing is that it emits more tokens than we need (whitespace) and splits things we would rather not split (%(color)s is considered as a single word right now, but would become 5).

We undoubtedly do not want to count whitespace, so these would have to be filtered out.

So the solution could be based on this:

from icu import BreakIterator, Locale

def word_count(text, language):
    boundary = BreakIterator.createWordInstance(Locale(language.code))
    boundary.setText(text)
    count = 0
    for part in boundary:
        # Skip WORD_NONE, what is typically whitespace or standalone punctuation
        if boundary.getRuleStatus() == 0: 
            continue
        count += 1
    return count

yheuhtozr commented 11 months ago

@nijel

The first thing we should do is to specify what we want to count as a word. It is a metric which is used for several purposes (cost estimates while translating, it will be a pricing unit for some of our services).

To be clear, this is the very start point why we need Asian (here I refer to Chinese, Japanese, and Korean) "word" count. Both in the translation or other writing industries, we count in, and pay by, what we call "characters", which we also conventionally (and confusingly; in the translation industry) call it in English "words", and which is what roughly I described in the first comment.

Count only ideographs separately (https://github.com/holmesconan/vscode-wordcount-cjk)

Count all CJK chars as words (LibreOffice, https://github.com/magiclen/words-count)

Counting on spaces only (Google Docs, https://support.crowdin.com/crowdin-word-counter/)

Doing some computation based on chars (https://help.transifex.com/en/articles/6212250-how-are-hosted-words-calculated)

Those links you have listed (note: your third link is actually different from the description) above are all aiming for very similar result with perhaps different technical details. You can try them on any Asian sentence examples and I guess they output barely different numbers with at most a fraction of % if not identical. I can also provide additional sources like:

While memoQ does also allow user to count Korean text in English-like word count, character counting is much more prevalent in the actual world, at least from my experience collaborating with Korean translators.

Chinese uses ideographs, but in most cases one word consists of more ideographs. There is study on that.

In the end, I don't think we should implement the own word counting, as that is a pretty complex topic. Luckily, there is an existing standard and implementation with Python bindings.

This kind of true, linguistic words are only obtainable through close analysis (or more recently non-deterministic using trained model) thus not credited as index of amount of text. Imagine how you even do this before the rise of computers. A couple of SE questions for your cultural input:

There are definitely more languages which do not separate words and do not use any of the CJK (for example https://stackoverflow.com/q/4861619/225718).

Yes, AFAIK Thai and Khmer are the only languages with no reliable/standard word counting system. I have only heard about that some use syllable count while other characters or clusters of character (one vertical chunk of base character + dependent marks of vowels etc.)

yheuhtozr commented 11 months ago

As for the comments on the other thread, I don't think an algorithm like mine is overly complicated (though need more optimization); it is just judging if the next code point is CJK or not, then branch the counting method. If it looks monstrous, it is because the nature of Unicode that does not build such information into bits. If we can assume that only either ASCII or CJK characters can appear in East Asian text, the logic might be able to be streamlined a lot, but that is risky.

I'd also like to see some standardized or popular implementation happens to exist, but sadly it is not uncommon at all that such language-specific tools not introduced in the computer world if you live outside Europe and North America.

nijel commented 11 months ago

To be clear, this is the very start point why we need Asian (here I refer to Chinese, Japanese, and Korean) "word" count. Both in the translation or other writing industries, we count in, and pay by, what we call "characters", which we also conventionally (and confusingly; in the translation industry) call it in English "words", and which is what roughly I described in the first comment.

There already is a separate character count metric, and it is shown on all listings (if there is a screen space for that). IMHO, word metric should really be what words are supposed to be so that it is roughly comparable between languages. Therefore, I think ICU-based solution is a suitable approach to that. Do you see any problem in that?

It is a widely used solution, but typically not for word counting, but for text selection by double click to select a word.

Performance wise, it would be on par with other solutions we've discussed (the actual word split is about 3x faster, but it will need additional processing to get the corner cases right).

As for the comments on the other thread, I don't think an algorithm like mine is overly complicated (though need more optimization); it is just judging if the next code point is CJK or not, then branch the counting method.

I don't think it is overly complicated, I'm just unsure if it gives numbers we want.

yheuhtozr commented 11 months ago

There already is a separate character count metric, and it is shown on all listings (if there is a screen space for that). IMHO, word metric should really be what words are supposed to be so that it is roughly comparable between languages. Therefore, I think ICU-based solution is a suitable approach to that. Do you see any problem in that?

Thank you for your input. I am not against for it to be an ambitious goal for your platform, but in reality, it contradicts the industry practice and expectation of nearly every user who uploads source string from East Asia. (Admittedly, traditional translation agencies set different rate based on language pairs in the first place, thus unified rate was not a problem.) So, the "standard character count" would be necessary in somewhere either way, in case they need to account to external clients.

IMHO, word metric should really be what words are supposed to be so that it is roughly comparable between languages.

By the way... this is unfortunately not true. Or it may be practically true among European languages due to high level of lexical homology, but does not hold across regions.

First on the technical level, if you use the ICU segmenter, space-dividing languages are just split by spaces IIRC. That means 中華民國 in Chinese is a single word, but Trung Hoa Dân quốc in Vietnamese is four, even though they are just different spelling of the same concept. This is because Vietnamese orthography splits in syllables, not words that we understand. Chinese and Vietnamese have a quite similar syntax, but you'd suddenly expect 2-3x smaller word count for Chinese for a sentence of same length.

Second, dictionary-based word segmentation relies heavily on dictionary, or the design principle behind it. For example, currently there are at least two major factions about word segmentation of Japanese, each with their own rationale and usability. So a chunk of phrase 食べられませんでしたが ("even though [I] was not able to eat") would be 1 word in one theory (similar to budoux), but would be 6 (食べ|られ|ません|でし|た|が) in another (similar to MeCab).

Also, while writing this comment, I just found the Chinese and Japanese dictionary ICU uses are probably inconsistent in granularity, as 中華人民共和国 spelled in Japanese is one word, but 中华人民共和国 in Chinese is three (both mean People's Republic of China).

On top of that, beware that the current implementation of ICU segmenter (you can test on browser) does not even recognize Japanese conjugation, so its output is totally pointless.

const jas = new Intl.Segmenter("ja", {granularity: "word"});
[...jas.segment("食べられませんでしたが")].map(s => s["segment"]).join("|")
// => '食|べら|れ|ま|せん|で|した|が' ???

Finally, don't forget to take care of Finnish, Hungarian, Turkish or Native American languages where you can put a sentence-equivalent content into single word... (and those are really single words in linguistics terms).

nijel commented 11 months ago

Thanks for sharing that, this is really an interesting topic (where my knowledge is quite limited). It seems it is nearly impossible to have a word count implementation which is always correct and handles all the corner cases. I think it is important to have a metric that works reasonably well in most situations.

Let's get back to the real world and look at the choices we have:

Count on whitespace only

Current implementation
Useless for CJK (and other languages)

Use CJK char count

As outlined in https://github.com/WeblateOrg/weblate/pull/10284
Industry standard for CJK (except for Transifex, which has own metric)
Doesn't cover some other languages, so we might end up implementing different rules for these in the future
Having own code for this will need a maintenance

Use ICU

As outlined in https://github.com/WeblateOrg/weblate/issues/10278#issuecomment-1833494243
Has support for most of the languages
We would not be maintaining this code

Anything else?

I'd really love to use an existing code for this and get reasonable results for most of the languages. For example, LibreOffice seems to do the word counting in Thai similarly as ICU and I would rather not reimplement this inside Weblate.

PS: https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/text-analysis/notebooks/Splitting%20words%20in%20East%20Asian%20languages.ipynb describes some language specific splitting libs.

yheuhtozr commented 11 months ago

I think, maybe, there is a small mismatch on what we have been imagining when we say word count. When we mention that we need to "count words" in English, there are at least three distinct dimensions.

count tokens in a sentence: for grammatical analysis or machine learning
count amount of content: for estimation of how much text work (to be) done, like in writing or translation
count units of price: for basis of payment

As I observed, your discussion tends to mix all of three at once, which is I speculate because typographical word of Latin alphabet serves for all of them. What I'd like to focus is #‌2, because this is the one meaningful in a translation platform. In East Asia, only #‌2 and #‌3 are measured by the same metric. You mentioned ICU segmenter several times, but its "word segmenter" chiefly accounts for #‌1 (same for the Google Colab article). It is certainly useful for web layout such as in which range becomes selected when you double click the text or how to word-wrap non-spacing languages, but means nothing in literary industry.

Industry standard for CJK (except for Transifex, which has own metric)

I think what you mean here by "metric" is #‌3, But if you look at the Transifex page, they do actually count words (as in #‌2) in East Asian standard method.

For example, the word count of the following phrase is 9 (five characters and four words):

今日 sunny です。 outdoors １２３ %(day)s

They just apply some multiplier to this number to balance with other languages, so their #‌3 is just a function of #‌2, and they are actually doing this work.

FYI, If you are not convinced why the three roles I listed above are not "naturally" inherent to words, there are other axes, e.g.:

count how much space a text takes up: for physical estimation in typography or printing

In East Asia, this one shares the (almost all of) logic with #‌2 and #‌3 by character-based counting, which is related to why we call "word count" as in this discussion character count in those languages as if a matter of course, even though it is not pure character count. In European languages, this is usually nothing to do with words but much more relevant with letter counting, if anything.

Anyway, focusing on #‌2, AFAIK the case branching is already exhausted for existing modern scripts, unless you try to support Hieroglyphs or some new creative writing systems should emerge in future:

Count by typographic space
- all spacing languages (but beware of some languages that do not use the exact ASCII space, such as Tibetan)
Count by character and word
- Chinese, Japanese, Korean, and minority languages around those regions that follow the model (such as Yi)
- each character in those languages is conceptually a self-standing syllable, so it is analogy of syllable counting
No rule-based counting, only by dictionary or learning model
- to my knowledge, Javanese, Khmer, Lao, Thai and languages around those regions with similar model (such as Tai Tham)
- they might really need language-specific solutions

So, while it ultimately depends on how you estimate the maintenance cost, with character-based counting method implemented, I think you will already cover the coverable.

nijel commented 11 months ago

The word counting is a number used in views and reports, and thus it should reflect the amount of work.
We're considering switching to pricing based on hosted words for our cloud hosting services, and that would make word the unit of price.

So I'm looking to cover both 2 and 3, and I don't want these two to diverge because that would be confusing. Moreover, we need a clear (and possibly simple) definition of the word for that.

Using character count as word count in CJK mirrors existing statistics, so the UI would show the same number twice.

The words metric for these languages is inflated by counting characters instead, and that is the reason for Transifex having Hosted Words = Source Language Characters × 50% × Number of Target Languages. Still, their formula doesn't cover languages not using CJK and not using spaces, but these are not widely used as a source language, so being wrong here doesn't matter that much.

Maybe using words for pricing is not that good idea, but the thing is that we want to base it on the amount of the content.

yheuhtozr commented 11 months ago

So I'm looking to cover both 2 and 3, and I don't want these two to diverge because that would be confusing.

Yes, that is what I implicitly anticipate too (or #‌3 depends on #‌2), so we are on the same page.

Using character count as word count in CJK mirrors existing statistics, so the UI would show the same number twice.

This only happens when the source text is fully written in CJK characters. In practice, a meaningful amount of text that does not contain a row of Arabic numerals is rarely seen. The pure Unicode character count (= len()) is just a technical detail, sometimes fairly close to, but not accurate "text amount count". For example, the example I gave on the very top should count 118, while its Unicode character count is 136.

nijel commented 11 months ago

In the end, I think we should:

Stick with current counting as default.
Implement counting based on one of the solutions outlined in https://github.com/WeblateOrg/weblate/pull/10284 and use it CJK languages.
Use ICU-based counting for languages like Javanese, Khmer, Lao, Thai where above would produce no reasonable results.

Possibly, 1 and 2 could be merged into a unified algorithm.

yheuhtozr commented 11 months ago

I think that is fine. As a technical sidenote, I managed to make the East Asian counter ~17x slower than split(), but no more faster.

nijel commented 11 months ago

That should be acceptable performance wise.

github-actions[bot] commented 10 months ago

Thank you for your report; the issue you have reported has just been fixed.

In case you see a problem with the fix, please comment on this issue.
In case you see a similar problem, please open a separate issue.
If you are happy with the outcome, don’t hesitate to support Weblate by making a donation.

WeblateOrg / weblate