cgiffard / Downsize

Tag safe text truncation for HTML and XML!
BSD 3-Clause "New" or "Revised" License
41 stars 13 forks source link

downsize seems does not handle Asian languages #15

Open liushuping opened 10 years ago

liushuping commented 10 years ago

For the character based Asian languages, "word" and "character" are actually the same concept, and words are not separated. For example the English sentence "The quick brown fox jumps over the lazy dog" in Chinese is "敏捷的棕毛狐狸从懒狗身上跃过". downsize that sentence to 2, we expect the result to be "敏捷", but the actual result(treated the whole sentence as a single word) is not.

cgiffard commented 10 years ago

Japanese is even harder, since it has a mixture of single and multi-character words, and words with both ideographic and phonetic components. I think it would be possible to implement a solution correcting this problem for Hanzi and Hangul that increments the word counter on every character in that range, but as far as I'm able to ascertain, that would actually make it harder to provide accurate word counts in Japanese. This isn't a straightforward solution by any stretch, and I might need to implement a technical standard for doing the word breaking for CJK. Other languages such as Arabic and Thai are also problematic — and unlike east asian languages, I've got absolutely no idea where to start with those.

The long and short of it is that I cannot, and do not want to include language dictionaries in order to do the word count (there are copyright issues there as well.) If I can come up with a solution that gets close enough, I think that's good enough — because the reality is that word-breaking many non-latin languages is insanely hard.

I would welcome any help on this from people who know their i10n!

discuss :)

cgiffard commented 10 years ago

I figure we: split the word count out into a special function separate from the counting block, and put the i10n logic in there. It's possible that the CJK counting might require lookahead which is not available via the streaming parser... which means the entire architecture will need a review.

cgiffard commented 10 years ago

It's actually really annoying that Han Unification happened — otherwise we'd be able to very easily tell whether text was Traditional Chinese, Simplified Chinese, Korean, or Japanese, just by looking at character ranges. As it is I think we might have to add a significant lookahead to the parser to try and guess the language before truncating. :-/

We could add a flag that allows shortcutting that by letting the user specify a language manually, eliminating false guesses and improving performance by removing the requirement for lookahead.

In the event that the lookahead sampler guessed wrongly — it would only likely guess Chinese where Japanese was the language (rather than the other way around) — the outcome would be that the Japanese snippets would be very short. I think that's manageable!

I would set the initial lookahead buffer size by the determining the longest kanji-only word in Japanese, and adding a bit of padding for HTML, etc. Hangul is easy to detect so that'd be an immediate shortcut. If a Japanese user uses a completely Kanji word that's longer than our buffer — well, that's a crazy edge case we probably shouldn't stress about.

I am concerned about mixed language posts. Mixing Chinese and English is relatively straightforward, but Japanese and English could be a bit of a headache... I need to research this more.

cgiffard commented 10 years ago

@yangl1996 @liushuping What's your expectation for a multi-character word like '北京'? Do you consider that one word or two?

yangl1996 commented 10 years ago

It is two words. Actually every single Chinese character is a word. Thanks ;)

yangl1996 commented 10 years ago

If there is anything I could help as a native Chinese speaker, I am more than glad to help. :P

adam-zethraeus commented 10 years ago

Since there'd still be clear issues when one is quoting text different from ones base language I think taking user input to define what language the text should be considered as makes most sense. Perhaps even the ability to specify language per segment?

Actually doing language detection in Downsize seems like a huge can of worms to me. Perhaps there's a good project somewhere that will identify the segments of an article with different languages that can provide the metadata? In terms of function modularization I'd strongly advocate for making or using an external project rather than baking the functionality into Downsize.

cgiffard commented 10 years ago

I think, reading the unicode text-segmentation report/whitepaper that building a proper solution is not just hard — it might actually be computationally impossible. I think I could probably build a very naive set of rules for Japanese, which would provide a somewhat ugly but workable solution for those users.

If Chinese users are happy for text to be truncated character-wise, we could create an option { chinese: true } which turns split-by-character on, but then any English would be broken in that way too. Alternately we could increment the word-counter on every hanzi character using a simple range check.

Either way, this is probably going to be the hardest bug to fix... I've ever hit... in my life. If the unicode body thinks it's impossible...

liushuping commented 10 years ago

@cgiffard character-wise truncation for Chinese is expected. However, if content is mixed language (for instance English + Chinese), we expect English to be word-wise truncation while Chinese is character-wise truncation.

For your information, Chinese characters are normally in range \u4e00 to \u9fa5. You may find more information at http://www.unicode.org/charts/PDF/U4E00.pdf

cgiffard commented 10 years ago

I think the killer is still going to be Chinese-vs-Japanese text counting. It's easy to see the following text and split by character:

我住在北京 = 5 words

But when hit with many of the exact same kanji/hanzi and kana characters, the expectation is totally different.

北京に住んでいます =   北京に   住んでいます = 2 words

So there've got to be at least three different counting rules. We can have them set by a flag, but it isn't going to be straightforward even then.

adam-zethraeus commented 10 years ago

It's pretty clear that making a solution that works for all languages is out of scope (i.e. language guessing heuristics (ergh look at the size of the java projects that do this), many rule sets, and months of work).

However, is it the case that as a simple heuristic one-chinese-character===one-word on the internet?

If adding just a third counting type, say 'simple-chinese' that allows mixing anglo words and chinese characters as words will make downsize usable for the majority of casual online chinese writing (i.e. blogs) maybe it's a useful (if totally wrong) heuristic we could use.

@yangl1996 @liushuping : Is it the case that this heuristic would be an improvement for the majority of Chinese blogs?

cgiffard commented 10 years ago

How about an option like { breakHanzi: true/false } which toggles between Japanese and Chinese Hanzi/Kanji breaking modes? Then for the Chinese mode we could:

And then in Japanese:

That leaves Arabic, Thai, Hindi... and dozens of other non-latin scripts unaccounted for. But with only a rough set of rules we can cover an additional 1.5 billion citizens of Earth, approximately. I figure it's a good enough start. :)

adam-zethraeus commented 10 years ago

Doing it extensibly is going to be tricky, but i like the 1.5billion thing :)

cgiffard commented 10 years ago

Once you're happy with #16, I might attack this. Sorry for my absence over here!

Arch1tect commented 8 years ago

Wow, never thought that this is so complex...Let me know if you need any help, I have a blog that's written in both Chinese and English here: Lifeislikeaboat.com and I'm using Ghost which uses Downsize..