amplab / training

Training materials for Strata, AMP Camp, etc
150 stars 121 forks source link

ampcamp6: parquet read & tachyon config bugs #207

Open tranlm opened 8 years ago

tranlm commented 8 years ago

Running through the exercise code, here are some issues I found:

Data Exploration using Spark SQL page: 1) "parquetFile" has been deprecated and the resulting code should be changed to wikiData = sqlCtx.read.parquet("data/wiki_parquet")

Explore In-Memory Data Store Tachyon page: 1) the "tachyon" folder is now a subfolder of spark 2) TACHYON_WORKER_MEMORY_SIZE is already set at 1GB 3) When I try to format the storage using the command "tachyon format", class tachyon.Format cannot be found: to fix:

  export TACHYON_JARS="$TACHYON_HOME/../lib/tachyon-assemblies-${VERSION}-jar-with-dependencies.jar"

4) the command "tachyon runTests" fails all the tests 5) In the section "Run Spark on Tachyon", the command " ./bin/spark-shell" is specific to only Scala. Should be generalized for users using other languages, e.g. Python

Querying compressed RDDs with Succinct Spark page: 1) Correct "articleIds.count" to say "articleIdsRDD.count" 2) "val succinctWikiKV = wikiKV.map(t => (t._1, t._2.getBytes).succinctKV" is missing an ending parentheses, i.e. ")". 3) Should combine

val wikiKV2 = sc.textFile("data/succinct/wiki-large.txt")
    .map(_.split('|'))
    .map(t => (t(0).toLong, t(1)))

into one line

val wikiKV2 = sc.textFile("data/succinct/wiki-large.txt").map(_.split('|')).map(t => (t(0).toLong, t(1)))

4) Change

val wikiSuccinctKV2 = sc.succinctKV[Long]("data/succinct/succinct-wiki-large")
wikiSuccinctKV2.count

to

val succinctWikiKV2 = sc.succinctKV[Long]("data/succinct/succinct-wiki-large")
succinctWikiKV2.count

5) Change "val articleIdsRDD3= succinctWikiKV3.regexSearch("(stanford|berkeley).edu")" to "val articleIdsRDD3= succinctWikiKV2.regexSearch("(stanford|berkeley).edu")"

gostevehoward commented 8 years ago

The line

export TACHYON_JARS="$TACHYON_HOME/../lib/tachyon-assemblies-${VERSION}-jar-with-dependencies.jar"

is an edit to a line in spark/tachyon/libexec/tachyon-config.sh