databricks / spark-avro

Avro Data Source for Apache Spark
http://databricks.com/
Apache License 2.0
539 stars 310 forks source link

Add Date and Timestamp data types support when reading avro fields #253

Closed viirya closed 7 years ago

viirya commented 7 years ago

Related issue: #229 Related Spark JIRA ticket: https://issues.apache.org/jira/browse/SPARK-22460

Seems it is somehow inconvenient when reading avro files written with DataFrame with timestamp field.

It might be easier to read such data fields if we can explicitly require this data source to interpret a field as timestamp type.

This also add the support of date type together.

codecov-io commented 7 years ago

Codecov Report

Merging #253 into master will increase coverage by 0.05%. The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #253      +/-   ##
==========================================
+ Coverage   90.71%   90.77%   +0.05%     
==========================================
  Files           5        5              
  Lines         334      336       +2     
  Branches       50       50              
==========================================
+ Hits          303      305       +2     
  Misses         31       31
viirya commented 7 years ago

To support the use case @saniyatech mentioned, timestamp should be straightforward because it is stored as long value. We can add the logical type Timestamp (millisecond precision) to the avro's long schema.

Date maybe a real problem. I don't know why spark-avro stores date as long value of the milliseconds too. Avro's logical type Date is an Avro int. I think the logical type property isn't compatible to an avro's long schema...

To change it from long to int for a date field, it risks backward compatibility.

viirya commented 7 years ago

Btw, seems 1.7.6, the avro version currently used, doesn't support logical types yet. Based on above reason, I think we can only correctly deserialize date/timestamp fields when a Catalyst schema is provided.

gengliangwang commented 7 years ago

Thanks, merge to master.

viirya commented 7 years ago

Thanks @gengliangwang @saniyatech for review.

rondefreitas commented 6 years ago

@gengliangwang @saniyatech any word on when this will get cut to a release? It's nearly a year since this has been fixed but it's still broken in 4.0

gengliangwang commented 6 years ago

@rdefreitas Probably there is no new release, as this repo is migrated into Spark 2.4 as built-in data source module.

rondefreitas commented 6 years ago

@gengliangwang any idea when that release is scheduled or where I can find that?

gengliangwang commented 6 years ago

It will be within this October, I think.
You can try it by building the latest apache spark (master branch or branch-2.4).