GoogleCloudPlatform / market-data-transcoder

ffmpeg for market data
Apache License 2.0
35 stars 11 forks source link

Schemas with repeating values at different levels in the message graph are incompatible with Avro #73

Open salsferrazza opened 1 year ago

salsferrazza commented 1 year ago

When running with the following set of options:

wget -q -O - ftp://ftp.cmegroup.com/SBEFix/Production/secdef.dat.gz | gunzip -  | txcode \ 
 --schema_file  ~/src/datacast/transcoder/test/FIX50SP2.CME.xml  \
 --factory fix  \
 --source_file_format_type line_delimited  \
 --message_type_inclusions=SecurityDefinition \
 --fix_header_tags 8,9,35,1128,49,56,34,52,10 \
 --destination_project_id $(gcloud config get-value project)  \ 
 --output_type pubsub \
 --output_encoding json \
  --lazy_create_resources 

The following error is thrown:

google.api_core.exceptions.InvalidArgument: 400 AVRO schema definition is not valid: sbeMessage.NoLotTypeRules exists twice in schema. [detail: "[ORIGINAL ERROR] generic::invalid_argument: AVRO schema definition is not valid: sbeMessage.NoLotTypeRules exists twice in schema. [google.rpc.error_details_ext] { message: \"AVRO schema definition is not valid: sbeMessage.NoLotTypeRules exists twice in schema.\" }"

This is likely because of the complex schema defined in FIX50SP2.CME.xml, where the same field name may be present at different levels of the entity hierarchy. A graph such as Object.Property1 and Object.Property2.Property1 appears to be incompatible with Avro, but is commonly encountered within legitimate FIX schema definitions.

It's notable that BigQuery output types do not exhibit this behavior, but the fastavro output type to local POSIX files does as well:

fastavro._schema_common.SchemaParseException: redefined named type: sbeMessage.NoLotTypeRules
salsferrazza commented 1 year ago

https://stackoverflow.com/a/48131460

It's not well documented, but Avro allows you to reference previously defined names by using the full namespace for the name that is being referenced. In your case, the following code would result in only one class being generated, referenced by each array. It also DRYs up the schema nicely.

For each level of nested record, a namespace can be applied to distinguish that tier's fields from local homonyms.