Open abatyuk opened 9 years ago
I've noticed that, actually. I need to investigate why. I'm not very good with character encoding issues ;)
Dean Wampler, Ph.D. Typesafe "Functional Programming for Java Developers", "Programming Scala", and "Programming Hive" - all from O'Reilly twitter: @deanwampler, @chicagoscala http://typesafe.com http://polyglotprogramming.com
On Fri, Feb 6, 2015 at 8:42 AM, Andrey Batyuk notifications@github.com wrote:
// Exercise: Use other versions of the Bible: // The data directory contains similar files for the Tanach (t3utf.dat - in Hebrew),
Doesn't count any hebrew words actually, same with cyrillyc - only counted numbers and sup/font etc
— Reply to this email directly or view it on GitHub https://github.com/deanwampler/spark-workshop/issues/11.
I'll see what I can do in weekend - I have few ideas how to investigate.
On Feb 6, 2015, at 10:25 AM, Dean Wampler notifications@github.com wrote:
I've noticed that, actually. I need to investigate why. I'm not very good with character encoding issues ;)
Dean Wampler, Ph.D. Typesafe "Functional Programming for Java Developers", "Programming Scala", and "Programming Hive" - all from O'Reilly twitter: @deanwampler, @chicagoscala http://typesafe.com http://polyglotprogramming.com
On Fri, Feb 6, 2015 at 8:42 AM, Andrey Batyuk notifications@github.com wrote:
// Exercise: Use other versions of the Bible: // The data directory contains similar files for the Tanach (t3utf.dat - in Hebrew),
Doesn't count any hebrew words actually, same with cyrillyc - only counted numbers and sup/font etc
— Reply to this email directly or view it on GitHub https://github.com/deanwampler/spark-workshop/issues/11.
— Reply to this email directly or view it on GitHub.
I did a little reading and the issue is probably the underlying Hadoop API. SparkContext.textFile
uses the Hadoop Text type, a subtype of Writable
. Text
is only designed for UTF-8. I believe Hebrew and Cyrillic require UTF-16, unless I'm mistaken.
Doesn't count any hebrew words actually, same with cyrillyc - only counted numbers and sup/font etc