databricks / LearningSparkV2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
https://learning.oreilly.com/library/view/learning-spark-2nd/9781492050032/
Apache License 2.0
1.2k stars 728 forks source link

Example 3_7 - Scala Issue #65

Closed KMontano18 closed 3 years ago

KMontano18 commented 3 years ago

Unable to get Scala code to read blogs.json via the Schema definition provided in the book.

[Code]

`package main.scala.chapter3

import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types._ import org.apache.spark.sql.functions.{col, expr}

object Example3_7 { def main(args: Array[String]) {

val spark = SparkSession
  .builder
  .appName("Example-3_7")
  .getOrCreate()

if (args.length <= 0) {
  println("usage Example3_7 <file path to blogs.json")
  System.exit(1)
}
//get the path to the JSON file
val jsonFile = args(0)
//define our schema as before
val schema = StructType(Array(
    StructField("Campaigns", ArrayType(StringType), true),
    StructField("First", StringType, true),
    StructField("Hits", LongType, true),
    StructField("Id", LongType, true),
    StructField("Last", StringType, true),
    StructField("Published", StringType, true),
    StructField("Url", StringType, true)
))

//Create a DataFrame by reading from the JSON file a predefined Schema
val blogsDF = spark.read.schema(schema).json(jsonFile)
//show the DataFrame schema as output
blogsDF.show(false)
// print the schemas
println(blogsDF.printSchema)
println(blogsDF.schema)

} }`

[End Code]

LongType and IntegerType were both used for Hits and Id, but both have generated the following error on my machine

scala> kmontano18@DESKTOP-PRKRT1A:~$ spark-shell -i scalaSchema.scala blogs.json blogs.json:1: error: identifier expected but integer literal found. {"Id":1, "First": "Jules", "Last":"Damji", "Url":"https://tinyurl.1", "Published":"1/4/2016", "Hits": 4535, "Campaigns": ["twitter", "LinkedIn"]}

However, when reading the json via Spark's implicit read, it generated the expected schema, albeit in a different order

scala> val df = spark.read.json("blogs.json") df: org.apache.spark.sql.DataFrame = [Campaigns: array, First: string ... 5 more fields]

scala> df.printSchema() root |-- Campaigns: array (nullable = true) | |-- element: string (containsNull = true) |-- First: string (nullable = true) |-- Hits: long (nullable = true) |-- Id: long (nullable = true) |-- Last: string (nullable = true) |-- Published: string (nullable = true) |-- Url: string (nullable = true)

image

dmatrix commented 3 years ago

@KMontano18 The schema defined in the book is not what you have in the above code.

Check page 53:

Screen Shot 2021-04-12 at 3 44 03 PM

brookewenig commented 3 years ago

Please use this schema from page 53:


val schema = StructType(Array(StructField("Id", IntegerType, false),
StructField("First", StringType, false),
StructField("Last", StringType, false),
StructField("Url", StringType, false),
StructField("Published", StringType, false),
StructField("Hits", IntegerType, false),
StructField("Campaigns", ArrayType(StringType), false)))```