AbsaOSS / ABRiS

Avro SerDe for Apache Spark structured APIs.
Apache License 2.0
230 stars 75 forks source link

Remove automatic schema generation and registration for Confluent avro (Spark -> Avro) #132

Closed cerveada closed 4 years ago

cerveada commented 4 years ago

One feature of Abris is to generate schema automatically from Spark data frame and register it in the registry, all of this during serialization to avro, but is this something that is actually useful? Schema is generated each time Abris is called (even more because each executor registers its own schema). So as a result there may be a lot of schemas registered.

It seems to me it's always better to register schema beforehand and then just tell Abris coordinates in the registry. So maybe Abris shouldn't ever register schema by itself.

So I think it would make sense to remove this functionality, but I'm not sure if anyone is actually using it this way or find it useful?

felipemmelo commented 4 years ago

I honestly don't remember the main reason because it's been almost 3 years, but one advantage of inferring the schema from the DataFrame and registering it is that you may consume from an unrelated sort and write into a streaming source without caring about schemas and compatibility, since if the new schema does not follow the compatibility rules the operation would fail right away.

There was also a helper function to help in converting from Spark to Avro schemas and then the latter could be used.

Finally, in Abris 2, the schema was stored only once per job, still on the driver side.

So, since we are not sure about the usefulness of this feature, we could probably remove it from the default behaviour but still provide some helper to assist with it, i.e. a separate class that receives a Spark schema and Schema Registry access options and gets the registration done, returning the resulting Avro schema.

kevinwallimann commented 4 years ago

Randomly stumbled upon this issue. For Hyperdrive, auto-registering the schema is a very useful feature. It simply takes away that burden. As a user, the simplicity of just calling to_confluent_avro as opposed to separately first registering the schema is very attractive. However, I can also see that you can end up with many unnecessary calls to the schema registry and the users may want to control it themselves.

cerveada commented 4 years ago

I discussed the issue with Kevin and there seems to be use cases for automatic schema generation as well. It shouldn't be removed, but the functionality should be checked to make sure redundant schemas are not registered.

We should also create tools for user that allows them to check whether schema is already in registry and register it before calling to_avro...

felipemmelo commented 4 years ago

Yep, totally agree. Or we could also just move the registration to outside the Catalyst expression, like here

vphutane commented 4 years ago

Hi @cerveada you mentioned about feature "One feature of Abris is to generate schema automatically from Spark data frame and register it in the registry". I know the first time it will generate the schema with version as v1 in the schema registry. However if there is a change in dataframe with an addition of column, will Abris automatically generate the v2 version of schema with the changed dataframe columns or will it require a schema file to be provided in the to_confluent_avro method

cerveada commented 4 years ago

Hello @vphutane. I already tried to answer that in #140, but maybe I was not clear.

Currently, Abris is not able to automatically generate compatible schema with the previous version of that schema. So in that case you have to provide the schema manually.

cerveada commented 4 years ago

@felipemmelo

Yep, totally agree. Or we could also just move the registration to outside the Catalyst expression, like here

I think that is already happening when possible, but when you generate the schema you don't know the data outside of the expresion, it arrives later inside the expresion and therefore that is the only place where the schema can be generated.

But now I realized the schema could be generated from the data frame. So I think the best option is to provide some tool for user to generate schema from dataframe and colum outside of the toAvro function (dataframe is not accessible form inside anyway). The user can than provide the schema as if he had it from the start.

It should get rid of unwanted schema generations in the expression and also it will simplify Abris. When the schema is provided we can register it only if it isn't registered.

cerveada commented 4 years ago

Solved by #149