Open nbali opened 1 year ago
I could even contribute this, if it gets the greenlight that conceptually it is acceptable.
I understand the current behavior is more like a promoting/casting INT -> LONG (and if also happens on other types) to align the avro schema to the final BigQuery table destination schema.
If the consideration is cost (INT uses smaller space than LONG), BigQuery storage API is more promising (though the work is ongoing). CC: @reuvenlax
Using long
instead of int
in the original class would be an ugly workaround IMO. The developers shouldn't have to consider this loss of information when writing the code. Not to mention it might be impossible for the developer to change the type of the problematic field. (Remapping to a new class is an even uglier workaround).
The mentioned example was just that. An example. There could be other lost info there as well. Anything that any schema representation contains, Avro could also contain, but does not exist in TableSchema
.
The fact that the labels have changed means my idea is viable, and I can contribute it, or that the issue is valid, but it doesn't mean anything regarding my proposed solution? I still only want to contribute if it has a chance of being accepted.
(I would prefer to implement this while I'm still having available hours for OSS contributions, so any feedback would be appreciated to my previous questions.)
What happened?
When working with
BigQueryIO.Write.withAvroFormatFunction()
the function's input isAvroWriteRequest<T>
, which is essentiallyT
with an AvroSchema
. The problem is how this provided schema gets created, and how it's being used.For that a configurable
withAvroSchemaFactory
method exists, but what it accepts is essentially aSerializableFunction<@Nullable TableSchema, org.apache.avro.Schema>
, and this is the problem.TableSchema
might already lost context. For example Avro knows INT and LONG, BQ recognizes no difference. So at the format function we might receive a schema that is different from our desired one, and there is no reversible way to transform it back.Using our desired schema at the format function doesn't work either, because the
DataFileWriter
inAvroRowWriter
already uses aDatumWriter
using the providedSchema
that might fail.To give you a more exact example, I had a POJO with an
int
field, and already methods to provide me with the BeamRow
andSchema
and AvroSchema,
andGenericRecord
for that POJO as I need that for other purposes. TheTableSchema
returned by theDynamicDestinations
obviously containslong
due to supported BQ types, so the schemafactory-generated Avro schema also containslong
. Meanwhile theRow
andGenericRecord
provided by my custom code containsint
as the POJO did as well. ... and you guessed well, an exception happens when it tries to use anint
aslong
during writing.My "methods" actually use
toAvroSchema
andtoGenericRecord
fromorg.apache.beam.sdk.schemas.utils.AvroUtils
, so it's not some custom code, but internal Beam code.So to sum things up IMO the
SchemaFactory
should have the ability to use more context/infos than just aTableSchema
. Given how it's being called at https://github.com/apache/beam/blob/40838f76447a5250b52645210f26dce3655d7009/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/RowWriterFactory.java#L143-L145 using thedestination
might already be helpful.Code to reproduce:
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components