Closed yheuhtozr closed 10 months ago
Should be pretty easy to implement, I've just factored out word counting to a helper function in https://github.com/WeblateOrg/weblate/pull/10279. There is already Language.uses_ngram
which could trigger this way of counting...
This issue seems to be a good fit for newbie contributors. You are welcome to contribute to Weblate! Don't hesitate to ask any questions you would have while implementing this.
You can learn about how to get started in our contributors documentation.
@nijel
There is already
Language.uses_ngram
which could trigger this way of counting...
Thanks, great to hear. Does it already have some implementation with any effect?
It is purely based on the language code:
@nijel Thank you. What I wanted to ask is actually, if I should change the list of languages in uses_ngram
, would it have any side effect?
It affects glossary matching as well (for the same reasons as word counting), so it should have a positive effect on such languages (but it also applies only when used as a source language).
https://stackoverflow.com/a/16528427/225718 might also be an approach worth benchmarking against regex-based solution in https://github.com/WeblateOrg/weblate/pull/10284
@nijel Thanks for the pointer. I am having some rush work recently and may come back next week or so.
I originally wanted to reply at https://github.com/WeblateOrg/weblate/pull/10284, but I think the discussion more belongs here.
The first thing we should do is to specify what we want to count as a word. It is a metric which is used for several purposes (cost estimates while translating, it will be a pricing unit for some of our services).
Looking around, there are several approaches:
Taking it language by language:
In the end, I don't think we should implement the own word counting, as that is a pretty complex topic. Luckily, there is an existing standard and implementation with Python bindings.
The good thing is that it does split which seems to make sense for Chinese, Japanese, Korean and Khmer (note that I don't know these languages, so I might be easily wrong here), so that the hard work is addressed.
The bad thing is that it emits more tokens than we need (whitespace) and splits things we would rather not split (%(color)s
is considered as a single word right now, but would become 5).
We undoubtedly do not want to count whitespace, so these would have to be filtered out.
So the solution could be based on this:
from icu import BreakIterator, Locale
def word_count(text, language):
boundary = BreakIterator.createWordInstance(Locale(language.code))
boundary.setText(text)
count = 0
for part in boundary:
# Skip WORD_NONE, what is typically whitespace or standalone punctuation
if boundary.getRuleStatus() == 0:
continue
count += 1
return count
@nijel
The first thing we should do is to specify what we want to count as a word. It is a metric which is used for several purposes (cost estimates while translating, it will be a pricing unit for some of our services).
To be clear, this is the very start point why we need Asian (here I refer to Chinese, Japanese, and Korean) "word" count. Both in the translation or other writing industries, we count in, and pay by, what we call "characters", which we also conventionally (and confusingly; in the translation industry) call it in English "words", and which is what roughly I described in the first comment.
- Count only ideographs separately (https://github.com/holmesconan/vscode-wordcount-cjk)
- Count all CJK chars as words (LibreOffice, https://github.com/magiclen/words-count)
- Counting on spaces only (Google Docs, https://support.crowdin.com/crowdin-word-counter/)
- Doing some computation based on chars (https://help.transifex.com/en/articles/6212250-how-are-hosted-words-calculated)
Those links you have listed (note: your third link is actually different from the description) above are all aiming for very similar result with perhaps different technical details. You can try them on any Asian sentence examples and I guess they output barely different numbers with at most a fraction of % if not identical. I can also provide additional sources like:
While memoQ does also allow user to count Korean text in English-like word count, character counting is much more prevalent in the actual world, at least from my experience collaborating with Korean translators.
- Chinese uses ideographs, but in most cases one word consists of more ideographs. There is study on that.
In the end, I don't think we should implement the own word counting, as that is a pretty complex topic. Luckily, there is an existing standard and implementation with Python bindings.
This kind of true, linguistic words are only obtainable through close analysis (or more recently non-deterministic using trained model) thus not credited as index of amount of text. Imagine how you even do this before the rise of computers. A couple of SE questions for your cultural input:
- There are definitely more languages which do not separate words and do not use any of the CJK (for example https://stackoverflow.com/q/4861619/225718).
Yes, AFAIK Thai and Khmer are the only languages with no reliable/standard word counting system. I have only heard about that some use syllable count while other characters or clusters of character (one vertical chunk of base character + dependent marks of vowels etc.)
As for the comments on the other thread, I don't think an algorithm like mine is overly complicated (though need more optimization); it is just judging if the next code point is CJK or not, then branch the counting method. If it looks monstrous, it is because the nature of Unicode that does not build such information into bits. If we can assume that only either ASCII or CJK characters can appear in East Asian text, the logic might be able to be streamlined a lot, but that is risky.
I'd also like to see some standardized or popular implementation happens to exist, but sadly it is not uncommon at all that such language-specific tools not introduced in the computer world if you live outside Europe and North America.
To be clear, this is the very start point why we need Asian (here I refer to Chinese, Japanese, and Korean) "word" count. Both in the translation or other writing industries, we count in, and pay by, what we call "characters", which we also conventionally (and confusingly; in the translation industry) call it in English "words", and which is what roughly I described in the first comment.
There already is a separate character count metric, and it is shown on all listings (if there is a screen space for that). IMHO, word metric should really be what words are supposed to be so that it is roughly comparable between languages. Therefore, I think ICU-based solution is a suitable approach to that. Do you see any problem in that?
It is a widely used solution, but typically not for word counting, but for text selection by double click to select a word.
Performance wise, it would be on par with other solutions we've discussed (the actual word split is about 3x faster, but it will need additional processing to get the corner cases right).
As for the comments on the other thread, I don't think an algorithm like mine is overly complicated (though need more optimization); it is just judging if the next code point is CJK or not, then branch the counting method.
I don't think it is overly complicated, I'm just unsure if it gives numbers we want.
There already is a separate character count metric, and it is shown on all listings (if there is a screen space for that). IMHO, word metric should really be what words are supposed to be so that it is roughly comparable between languages. Therefore, I think ICU-based solution is a suitable approach to that. Do you see any problem in that?
Thank you for your input. I am not against for it to be an ambitious goal for your platform, but in reality, it contradicts the industry practice and expectation of nearly every user who uploads source string from East Asia. (Admittedly, traditional translation agencies set different rate based on language pairs in the first place, thus unified rate was not a problem.) So, the "standard character count" would be necessary in somewhere either way, in case they need to account to external clients.
IMHO, word metric should really be what words are supposed to be so that it is roughly comparable between languages.
By the way... this is unfortunately not true. Or it may be practically true among European languages due to high level of lexical homology, but does not hold across regions.
First on the technical level, if you use the ICU segmenter, space-dividing languages are just split by spaces IIRC. That means 中華民國 in Chinese is a single word, but Trung Hoa Dân quốc in Vietnamese is four, even though they are just different spelling of the same concept. This is because Vietnamese orthography splits in syllables, not words that we understand. Chinese and Vietnamese have a quite similar syntax, but you'd suddenly expect 2-3x smaller word count for Chinese for a sentence of same length.
Second, dictionary-based word segmentation relies heavily on dictionary, or the design principle behind it. For example, currently there are at least two major factions about word segmentation of Japanese, each with their own rationale and usability. So a chunk of phrase 食べられませんでしたが ("even though [I] was not able to eat") would be 1 word in one theory (similar to budoux), but would be 6 (食べ|られ|ません|でし|た|が) in another (similar to MeCab).
Also, while writing this comment, I just found the Chinese and Japanese dictionary ICU uses are probably inconsistent in granularity, as 中華人民共和国 spelled in Japanese is one word, but 中华人民共和国 in Chinese is three (both mean People's Republic of China).
On top of that, beware that the current implementation of ICU segmenter (you can test on browser) does not even recognize Japanese conjugation, so its output is totally pointless.
const jas = new Intl.Segmenter("ja", {granularity: "word"});
[...jas.segment("食べられませんでしたが")].map(s => s["segment"]).join("|")
// => '食|べら|れ|ま|せん|で|した|が' ???
Finally, don't forget to take care of Finnish, Hungarian, Turkish or Native American languages where you can put a sentence-equivalent content into single word... (and those are really single words in linguistics terms).
Thanks for sharing that, this is really an interesting topic (where my knowledge is quite limited). It seems it is nearly impossible to have a word count implementation which is always correct and handles all the corner cases. I think it is important to have a metric that works reasonably well in most situations.
Let's get back to the real world and look at the choices we have:
Count on whitespace only
Use CJK char count
Use ICU
Anything else?
I'd really love to use an existing code for this and get reasonable results for most of the languages. For example, LibreOffice seems to do the word counting in Thai similarly as ICU and I would rather not reimplement this inside Weblate.
PS: https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/text-analysis/notebooks/Splitting%20words%20in%20East%20Asian%20languages.ipynb describes some language specific splitting libs.
I think, maybe, there is a small mismatch on what we have been imagining when we say word count. When we mention that we need to "count words" in English, there are at least three distinct dimensions.
As I observed, your discussion tends to mix all of three at once, which is I speculate because typographical word of Latin alphabet serves for all of them. What I'd like to focus is #2, because this is the one meaningful in a translation platform. In East Asia, only #2 and #3 are measured by the same metric. You mentioned ICU segmenter several times, but its "word segmenter" chiefly accounts for #1 (same for the Google Colab article). It is certainly useful for web layout such as in which range becomes selected when you double click the text or how to word-wrap non-spacing languages, but means nothing in literary industry.
Industry standard for CJK (except for Transifex, which has own metric)
I think what you mean here by "metric" is #3, But if you look at the Transifex page, they do actually count words (as in #2) in East Asian standard method.
For example, the word count of the following phrase is 9 (five characters and four words):
今日 sunny です。 outdoors 123 %(day)s
They just apply some multiplier to this number to balance with other languages, so their #3 is just a function of #2, and they are actually doing this work.
FYI, If you are not convinced why the three roles I listed above are not "naturally" inherent to words, there are other axes, e.g.:
In East Asia, this one shares the (almost all of) logic with #2 and #3 by character-based counting, which is related to why we call "word count" as in this discussion character count in those languages as if a matter of course, even though it is not pure character count. In European languages, this is usually nothing to do with words but much more relevant with letter counting, if anything.
Anyway, focusing on #2, AFAIK the case branching is already exhausted for existing modern scripts, unless you try to support Hieroglyphs or some new creative writing systems should emerge in future:
So, while it ultimately depends on how you estimate the maintenance cost, with character-based counting method implemented, I think you will already cover the coverable.
So I'm looking to cover both 2 and 3, and I don't want these two to diverge because that would be confusing. Moreover, we need a clear (and possibly simple) definition of the word for that.
Using character count as word count in CJK mirrors existing statistics, so the UI would show the same number twice.
The words metric for these languages is inflated by counting characters instead, and that is the reason for Transifex having Hosted Words = Source Language Characters × 50% × Number of Target Languages
. Still, their formula doesn't cover languages not using CJK and not using spaces, but these are not widely used as a source language, so being wrong here doesn't matter that much.
Maybe using words for pricing is not that good idea, but the thing is that we want to base it on the amount of the content.
So I'm looking to cover both 2 and 3, and I don't want these two to diverge because that would be confusing.
Yes, that is what I implicitly anticipate too (or #3 depends on #2), so we are on the same page.
Using character count as word count in CJK mirrors existing statistics, so the UI would show the same number twice.
This only happens when the source text is fully written in CJK characters. In practice, a meaningful amount of text that does not contain a row of Arabic numerals is rarely seen. The pure Unicode character count (= len()
) is just a technical detail, sometimes fairly close to, but not accurate "text amount count". For example, the example I gave on the very top should count 118, while its Unicode character count is 136.
In the end, I think we should:
Possibly, 1 and 2 could be merged into a unified algorithm.
I think that is fine. As a technical sidenote, I managed to make the East Asian counter ~17x slower than split()
, but no more faster.
That should be acceptable performance wise.
Thank you for your report; the issue you have reported has just been fixed.
Describe the problem
The current Weblate UI only shows/calculates number of words split by spaces. This is particularly inconvenient for Asian languages that do not employ spacing. Weblate reports most of my source strings (which are in Chinese) as having only one word regardless of actual length. For example, a simple word like
and a whole paragraph such as
are both "single word" based on the current algorithm.
Describe the solution you'd like
East Asian languages count characters instead of words for the amount of text. So, displaying character counts is more relevant for those languages. However, the "character count" in practice does not mean the simple number of characters, but hybrid counting of Asian characters and alphabetic words. I believe MS Word and major commercial CAT tools do it this way, with slight differences in detail. Applicable languages include Chinese, Japanese, Korean (technically able to count words but character counting is more prevalent), and possibly Thai (though not sure).
Possible solutions are:
Describe alternatives you've considered
No response
Screenshots
No response
Additional context
No response