Open koertkuipers opened 6 years ago
Merging #269 into master will increase coverage by
0.4%
. The diff coverage is87.5%
.
@@ Coverage Diff @@
## master #269 +/- ##
=========================================
+ Coverage 92.21% 92.61% +0.4%
=========================================
Files 5 5
Lines 321 325 +4
Branches 43 41 -2
=========================================
+ Hits 296 301 +5
+ Misses 25 24 -1
Have you considered using the schema from the newest data file to get the most up to date version of the schema? Or perhaps a configuration option to do that? Seems like most would update their schemas in a backwards compatible way and using the most recent schema would expose newer fields in the schema.
t hat is not a bad idea. a switch seems reasonable. i would suggest to do this in a separate branch
On Mon, Jun 4, 2018 at 4:09 PM, Carl Laird notifications@github.com wrote:
Have you considered using the schema from the newest data file to get the most up to date version of the schema? Or perhaps a configuration option to do that? Seems like most would update their schemas in a backwards compatible way and using the most recent schema would expose newer fields in the schema.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/databricks/spark-avro/pull/269#issuecomment-394482304, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyIJDw2Qzkuu_e48jdnUwwy_QxJJfr5ks5t5ZPngaJpZM4SCQnp .
@cwlaird3 good idea @koertkuipers how about by default use the latest AVRO file's schema?
@koertkuipers @cwlaird3 I checked with @liancheng , which is PMC member and one of the original author of Data source project. He doesn't think we should make such assumption. If the schema is different among files, users are supposed to specify the schema: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
This PR changes the behavior and is possible to cause regression to other users.
currently it uses a random file to pick schema. what would be an example of a user for which you break things by going from a random file to the last file?
I agree with @koertkuipers .. but if there's still a concern adding a configuration option to change the behavior could address that.
spark-avro already provides a mechanism for the user to provide a schema with the avroSchema
key in options
the thing that is currently missing is merging of schemas across all files
By configuration I meant a flag to enable the behavior you've implemented here - not to provide a schema.
oh a flag to go from random schema to non-random schema? if someone can come up with a user for which this pullreq breaks their usage i am up for that, otherwise no :)
On Fri, Jun 8, 2018 at 2:11 PM, Carl Laird notifications@github.com wrote:
By configuration I meant a flag to enable the behavior you've implemented here - not to provide a schema.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/databricks/spark-avro/pull/269#issuecomment-395843792, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyIJPNX1swwfXR9eCxVyPh_-8fUjgRmks5t6r5AgaJpZM4SCQnp .
Picking the same file consistently for schema avoids weird bugs where the schema of an avro data source changes randomly or unexpectedly.