databricks / spark-avro

Avro Data Source for Apache Spark

http://databricks.com/

Apache License 2.0

539 stars 310 forks source link

Pick last file sorted by path for schema #269

Open koertkuipers opened 6 years ago

koertkuipers commented 6 years ago

Picking the same file consistently for schema avoids weird bugs where the schema of an avro data source changes randomly or unexpectedly.

codecov-io commented 6 years ago

Codecov Report

Merging #269 into master will increase coverage by 0.4%. The diff coverage is 87.5%.

@@            Coverage Diff            @@
##           master     #269     +/-   ##
=========================================
+ Coverage   92.21%   92.61%   +0.4%     
=========================================
  Files           5        5             
  Lines         321      325      +4     
  Branches       43       41      -2     
=========================================
+ Hits          296      301      +5     
+ Misses         25       24      -1

cwlaird3 commented 6 years ago

Have you considered using the schema from the newest data file to get the most up to date version of the schema? Or perhaps a configuration option to do that? Seems like most would update their schemas in a backwards compatible way and using the most recent schema would expose newer fields in the schema.

koertkuipers commented 6 years ago

t hat is not a bad idea. a switch seems reasonable. i would suggest to do this in a separate branch

On Mon, Jun 4, 2018 at 4:09 PM, Carl Laird notifications@github.com wrote:

Have you considered using the schema from the newest data file to get the most up to date version of the schema? Or perhaps a configuration option to do that? Seems like most would update their schemas in a backwards compatible way and using the most recent schema would expose newer fields in the schema.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/databricks/spark-avro/pull/269#issuecomment-394482304, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyIJDw2Qzkuu_e48jdnUwwy_QxJJfr5ks5t5ZPngaJpZM4SCQnp .

gengliangwang commented 6 years ago

@cwlaird3 good idea @koertkuipers how about by default use the latest AVRO file's schema?

gengliangwang commented 6 years ago

@koertkuipers @cwlaird3 I checked with @liancheng , which is PMC member and one of the original author of Data source project. He doesn't think we should make such assumption. If the schema is different among files, users are supposed to specify the schema: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

This PR changes the behavior and is possible to cause regression to other users.

koertkuipers commented 6 years ago

currently it uses a random file to pick schema. what would be an example of a user for which you break things by going from a random file to the last file?

cwlaird3 commented 6 years ago

I agree with @koertkuipers .. but if there's still a concern adding a configuration option to change the behavior could address that.

koertkuipers commented 6 years ago

spark-avro already provides a mechanism for the user to provide a schema with the avroSchema key in options

the thing that is currently missing is merging of schemas across all files

cwlaird3 commented 6 years ago

By configuration I meant a flag to enable the behavior you've implemented here - not to provide a schema.

koertkuipers commented 6 years ago

oh a flag to go from random schema to non-random schema? if someone can come up with a user for which this pullreq breaks their usage i am up for that, otherwise no :)

On Fri, Jun 8, 2018 at 2:11 PM, Carl Laird notifications@github.com wrote:

By configuration I meant a flag to enable the behavior you've implemented here - not to provide a schema.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/databricks/spark-avro/pull/269#issuecomment-395843792, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyIJPNX1swwfXR9eCxVyPh_-8fUjgRmks5t6r5AgaJpZM4SCQnp .