deanwampler / spark-scala-tutorial

A free tutorial for Apache Spark.
Other
985 stars 430 forks source link

WordCount2 - doesn't work with non-Ascii characters #11

Open abatyuk opened 9 years ago

abatyuk commented 9 years ago
// Exercise: Use other versions of the Bible:
//   The data directory contains similar files for the Tanach (t3utf.dat - in Hebrew),

Doesn't count any hebrew words actually, same with cyrillyc - only counted numbers and sup/font etc

deanwampler commented 9 years ago

I've noticed that, actually. I need to investigate why. I'm not very good with character encoding issues ;)

Dean Wampler, Ph.D. Typesafe "Functional Programming for Java Developers", "Programming Scala", and "Programming Hive" - all from O'Reilly twitter: @deanwampler, @chicagoscala http://typesafe.com http://polyglotprogramming.com

On Fri, Feb 6, 2015 at 8:42 AM, Andrey Batyuk notifications@github.com wrote:

// Exercise: Use other versions of the Bible: // The data directory contains similar files for the Tanach (t3utf.dat - in Hebrew),

Doesn't count any hebrew words actually, same with cyrillyc - only counted numbers and sup/font etc

— Reply to this email directly or view it on GitHub https://github.com/deanwampler/spark-workshop/issues/11.

abatyuk commented 9 years ago

I'll see what I can do in weekend - I have few ideas how to investigate.

On Feb 6, 2015, at 10:25 AM, Dean Wampler notifications@github.com wrote:

I've noticed that, actually. I need to investigate why. I'm not very good with character encoding issues ;)

Dean Wampler, Ph.D. Typesafe "Functional Programming for Java Developers", "Programming Scala", and "Programming Hive" - all from O'Reilly twitter: @deanwampler, @chicagoscala http://typesafe.com http://polyglotprogramming.com

On Fri, Feb 6, 2015 at 8:42 AM, Andrey Batyuk notifications@github.com wrote:

// Exercise: Use other versions of the Bible: // The data directory contains similar files for the Tanach (t3utf.dat - in Hebrew),

Doesn't count any hebrew words actually, same with cyrillyc - only counted numbers and sup/font etc

— Reply to this email directly or view it on GitHub https://github.com/deanwampler/spark-workshop/issues/11.

— Reply to this email directly or view it on GitHub.

deanwampler commented 9 years ago

I did a little reading and the issue is probably the underlying Hadoop API. SparkContext.textFile uses the Hadoop Text type, a subtype of Writable. Text is only designed for UTF-8. I believe Hebrew and Cyrillic require UTF-16, unless I'm mistaken.