databricks / spark-avro

Avro Data Source for Apache Spark
http://databricks.com/
Apache License 2.0
539 stars 310 forks source link

Invalid: RDD[String] to avro #162

Closed msrvp closed 7 years ago

msrvp commented 8 years ago

I have an RDD[String](which is actually a JSON).

Wrote the following code:

case class ARecord(val rec: String)
val ardd = resultRDD.map(x => new ARecord(x))
val sdf = sQLContext.createDataFrame(ardd, ARecord.getClass) 
sdf.write.format("com.databricks.spark.avro").avro(avroPath);

However, when I try to view the avro file created, it is empty:

Objavro.schemaj{"type":"record","name":"topLevelRecord","fields":[]}avro.codec snappy?l'UIW(?O;,E??F?(
?l'UIW(?O;,E??F

The dataframe content becomes empty.

However, if I create the dataframe without the schema (ARecord), it retains all the data. However, on converting to avro I get the following exception:

java.lang.IllegalAccessError: tried to access class org.apache.avro.SchemaBuilder$FieldDefault from class com.databricks.spark.avro.SchemaConverters$$anonfun$convertStructToAvro$1
JoshRosen commented 7 years ago

I looked into this and the problem is that your sdf DataFrame doesn't have the right schema.

The problem here is that you're using the createDataFrame overload which accepts a JavaBean class and looks for bean properties, but, by default, Scala case classes don't follow the JavaBean spec (see http://alvinalexander.com/scala/scala-javabeans-beanproperty-annotation).

Instead, you should either use the createDataFrame overload which takes an implict ClassTag for a Product subtype, e.g. val sdf = sqlContext.createDataFrame[ARecord](ardd).