Question: Text Statistics for Other Languages (Korean)

jeremy-beasley commented 10 years ago

Slightly tangential so apologies in advance.

Do any of you know any text statistics that work with other languages? I'm looking for modifications of these metrics that would work with Korean.

Thanks for any pointers you could give me.

cgiffard commented 10 years ago

Unfortunately, I have absolutely no idea. I speak Japanese (crudely) and so I understand the problem space a little bit, but I'm in no way equipped to give you an answer on it. :(

Furthermore (I'm not sure how this is for Korean — correct me if I'm wrong) even word breaking in Asian languages usually requires the use of a dictionary, so if I wanted to do something similar accurately I'd need to start there.

The algorithms used here typically consider words complex by their number of syllables (it's a shortcut, but it works, broadly.) In order to do the same in korean you might have to have a Hangul lookup map with the syllabic complexity of each word... Thoughts?

In any case I don't think I'm really equipped to help you with your problem, but thanks for asking nonetheless. It's an interesting one!

jeremy-beasley commented 10 years ago

Hey, Chris.

Thanks for the quick response. I suspected the points you made about (1) word breaking and (2) syllables as a proxy for word complexity. I also reached out to other computational linguists to get their POVs. Will report back what I hear. Maybe there’s a way for me to extend what you’ve already done.

Stand by.

2014/02/06 12:22、Christopher Giffard notifications@github.com のメール：

Unfortunately, I have absolutely no idea. I speak Japanese (crudely) and so I understand the problem space a little bit, but I'm in no way equipped to give you an answer on it. :(

Furthermore (I'm not sure how this is for Korean — correct me if I'm wrong) even word breaking in Asian languages usually requires the use of a dictionary, so if I wanted to do something similar accurately I'd need to start there.

The algorithms used here typically consider words complex by their number of syllables (it's a shortcut, but it works, broadly.) In order to do the same in korean you might have to have a Hangul lookup map with the syllabic complexity of each word... Thoughts?

In any case I don't think I'm really equipped to help you with your problem, but thanks for asking nonetheless. It's an interesting one!

Amandysha commented 10 years ago

Hi @cgiffard! Pls, text statistics supports the Spanish language?

cgiffard commented 10 years ago

I don't speak Spanish, I'm afraid — and I'm not sure whether any of the algorithms in question actually work with non-English languages.

Do you have any thoughts as to where to start?

Amandysha commented 10 years ago

No problem @cgiffard. I just confirm that SMOG formula originally developed and tested in Inglés, was also validity for texts written in Spanish and French. Thank you!

cgiffard commented 10 years ago

Thanks @Amandysha.

I'm closing this issue as I think supporting every language is out of scope... for now.

DonaldTsang commented 4 years ago

@jeremybeasley @Amandysha please provide documentation for such test, and ask for a Python implementation in https://github.com/shivam5992/textstat first, that way people can understand such statistics better.

cgiffard / TextStatistics.js

Question: Text Statistics for Other Languages (Korean) #3