databricks / spark-avro

Avro Data Source for Apache Spark
http://databricks.com/
Apache License 2.0
539 stars 310 forks source link

Possible loss of precision when converting from SQL to Avro #29

Open felixcheung opened 9 years ago

felixcheung commented 9 years ago

According to this, TimestampType is converted to long:

TimestampType -> long

https://github.com/databricks/spark-avro/blob/master/README.md

However, java.sql.Timestamp has a nanos portion that is not fitted into a long. Would there be a loss of precision then? Should TimestampType -> string? http://docs.oracle.com/javase/7/docs/api/java/sql/Timestamp.html

felixcheung commented 9 years ago

AvroSaver.scala is calling Timestamp.getTime which returns the milliseconds but not the nanos value (see getNanos)

vlyubin commented 9 years ago

True. Avro doesn't have a timestamp type, so the two options are either using string (but then when you want to use it, you would probably want to convert string back to something meaningful), or to get a long representation, which is meaningful, but looses precision. We decided to go with the second option. It's unlikely that we can change it at this point, as someone may be relying on the current behavior. If for you purposes, you need the extra accuracy, you can preprocess your RDD to replace the timestamp column with strings that contain the necessary information. That way you won't lose precision and you're free to choose representation that suits your needs best.

felixcheung commented 9 years ago

I understand. Is there a way to have a config switch to map timestamp -> string? I'd be happy to contribute.

felixcheung commented 9 years ago

Also, framework in general have very good support for string -> timestamp-like types. Impala, for example, can handle that automatically: http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/impala_timestamp.html#timestamp

" Conversions: Impala automatically converts STRING literals of the correct format into TIMESTAMP values. Timestamp values are accepted in the format YYYY-MM-DD HH:MM:SS.sssssssss, and can consist of just the date, or just the time, with or without the fractional second portion. For example, you can specify TIMESTAMP values such as '1966-07-30', '08:30:00', or '1985-09-25 17:45:30.005'. You can cast an integer or floating-point value N to TIMESTAMP, producing a value that is N seconds past the start of the epoch date (January 1, 1970). "

naegelejd commented 8 years ago

Hi folks, it looks like this issue explains the following behavior. I'm using Spark 1.6.2 with spark-avro 2.0.1:

$ spark-shell --packages "com.databricks:spark-avro_2.10:2.0.1"
scala> import com.databricks.spark.avro._
scala> import java.sql.Timestamp
scala> import java.time.Instant
scala> case class Foo(ts: Timestamp)
scala> val foos = Seq.fill(100)(Foo(Timestamp.from(Instant.now())))
scala> val df = sc.parallelize(foos).toDF()
scala> df.write.orc("hdfs:///user/vagrant/tmp.orc")
scala> df.write.avro("hdfs:///user/vagrant/tmp.avro")
scala> val orcFoos = sqlContext.read.orc("hdfs:///user/vagrant/tmp.orc").as[Foo]
scala> val avroFoos = sqlContext.read.avro("hdfs:///user/vagrant/tmp.avro").as[Foo]
scala> orcFoos.take(10)
res4: Array[Foo] = Array(Foo(2016-09-13 17:59:18.207), Foo(2016-09-13 17:59:18.207), Foo(2016-09-13 17:59:18.207), Foo(2016-09-13 17:59:18.207), Foo(2016-09-13 17:59:18.207), Foo(2016-09-13 17:59:18.207), Foo(2016-09-13 17:59:18.207), Foo(2016-09-13 17:59:18.207), Foo(2016-09-13 17:59:18.207), Foo(2016-09-13 17:59:18.207))

scala> avroFoos.take(10)
res5: Array[Foo] = Array(Foo(48672-07-13 12:23:27.0), Foo(48672-07-13 12:23:27.0), Foo(48672-07-13 12:23:27.0), Foo(48672-07-13 12:23:27.0), Foo(48672-07-13 12:23:27.0), Foo(48672-07-13 12:23:27.0), Foo(48672-07-13 12:23:27.0), Foo(48672-07-13 12:23:27.0), Foo(48672-07-13 12:23:27.0), Foo(48672-07-13 12:23:27.0))

Note that the AVRO "Foos" contain incorrectly deserialized timestamps.

How can I work around this issue?