GoogleCloudPlatform / DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
http://cloud.google.com/dataflow
855 stars 324 forks source link

AvroUtils fails converting Avro message with ENUM to TableRow #485

Closed gadaldo closed 7 years ago

gadaldo commented 7 years ago

The method AvroUtils.convertGenericRecordToTableRow(GenericRecord record, TableSchema schema)does not support Avro messages whose schema contains enumeration.

schema:

{ "name": "Root",
  "type": "record",
  "fields": [  
                   { "name": "suit",
             "type": { "name": "Suit",
                    "type": "enum",
                     "symbols": ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"]
                         }
           }
    ]
}

message: {"suit" : "SPADES"}

table schema:

final TableSchema tableSchema = new TableSchema();
final List<TableFieldSchema> fields = new ArrayList<TableFieldSchema>();
fields.add(new TableFieldSchema().setName("suit").setType("STRING").setMode("REQUIRED"));
tableSchema.setFields(fields);

error:

Expected Avro schema type STRING, not ENUM, for BigQuery STRING field suit

error if the table schema has not "REQUIRED" mode:

Expected Avro schema type UNION, not ENUM, for BigQuery NULLABLE field suit

when executing:

final JsonDecoder decoder = DecoderFactory.get().jsonDecoder(schema, avroData);
final GenericDatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);
final GenericRecord record = reader.read(null, decoder);

final TableRow row = AvroUtils.convertGenericRecordToTableRow(record, tableSchema);

The problem is that AvroUtils does not treats enum at all.

dhalperi commented 7 years ago

Cc: @peihe @davorbonaci

Hmm. ENUM is not a type defined in either the legacy: https://cloud.google.com/bigquery/data-types or standard SQL: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types types. Can you point to documentation?

Note that that function is used only for converting Avro files produced by BigQuery.

gadaldo commented 7 years ago

Hi,

I see your point about documentation and the fact that that function is used only for converting Avro files produced by BigQuery (and I can't see how much this is useful then).

However if you allow me to have ENUM in Avro schema, then, in theory, I'd like to be able to perform the transformation of the value from GenericRecord enum to String in the TableRow object.

This documentation is mapping ENUM with string between Avro and Bigquery, so I thought it was possible and we introduced ANUM in our schemas, the solution in the algorithm is quite simple.

Is there any chance to have this feature in the future? Or do I have to relay on my own implementation to cope with this problem?

dhalperi commented 7 years ago

Hi @gadaldo -- Sorry for the delay. Yes, please implement your own support for this issue as we are unable to guarantee we will be able to provide a perfect translation from any AVRO schema into BigQuery schema. (Or file a feature request with BigQuery to make such a library available.)

gadaldo commented 7 years ago

HI @dhalperi -- I already wrote kind of library to transform each way: Bigquery schema to AVRO and vice-versa. The repo is here and hopefully is going to be clear and does not contains too many bugs. It's not perfect, means is a 1 to 1 transformation non-customisable. Feel free to spread the words and give me feedback. Giuseppe