Closed smorgan19 closed 1 year ago
Perhaps references to the schema definition would be helpful, as this is a meta-model mapping exercise. It seems the schema is expressed in JSON (not JSON) schema. Are you willing to help define this? It will take a lot of time and effort.
We were using Hadoop for a while, then moved away from it. So we have very limited interest. We have much more interest in expanding / completing OpenAPI capabilities as a priority, as we have not completed the features that had been defined by the API Work Group.
On Tue, May 23, 2023 at 12:03 PM smorgan19 @.***> wrote:
Avro is used in various processes for data serialization. It has rich data structures, is compact, fast, and is commonly used with Kafka, Hadoop, AWS and more. AVRO data serialization requires AVSC Schema Format which is fairly similar to JSON, but has a different data structure.
— Reply to this email directly, view it on GitHub https://github.com/OAGi/Score/issues/1500, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHXOQO6ZFWWSA65R4KVMDADXHTUV7ANCNFSM6AAAAAAYMF4ESU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
For storing data in data lakes or lake houses or for certain tools like kafka or hadoop, using the big data formats like AVRO, ORC, and Parquet are typically recommended.
ORC stands for Optimized Row Columnar (ORC) file format. This is a columnar file format and divided into header, body and footer
AVRO is an open source object container file format. Unlike the other two formats, it features row-based storage. Avro stores data definition in JSON so data can be easily read and interpreted. It uses the JSON file format for defining the data types, protocols and serializes the data in a compact binary format, making for efficient, resource-sparing storage
Parquet is an columnar data storage format that supports complex nested data structures in a flat columnar format. Parquet is perfect for services like AWS Athena and Amazon Redshift Spectrum which are serverless, interactive technologies.
Out of the three big data formats AVRO stand out for the following reasons:
Schema Based Format:
Ability to transform avro data to ORC and PARQUET formats
Industry Usage:
From a developer perspective avro has maven plugins and other resources that make it easier to develop with and allows for transformations into the other big data formats like ORC and Parquet.
There are limited tool options available to convert from XSD or JSON Schema to AVSC schema format. Those that are available are either outdated or not maintained.
Resources: https://bryteflow.com/how-to-choose-between-parquet-orc-and-avro/ https://avro.apache.org/docs/1.11.1/specification/ https://avro.apache.org/docs/1.11.1/getting-started-java/ https://data-flair.training/blogs/avro-uses/ https://www.upsolver.com/blog/the-file-format-fundamentals-of-big-data https://docs.oracle.com/cd/E26161_02/html/GettingStartedGuide/avroschemas.html#:~:text=Avro%20is%20used%20to%20define,Database%20record%20using%20Avro%20bindings. https://blog.knoldus.com/all-you-need-to-know-about-avro-schema/ https://www.confluent.io/blog/avro-kafka-data/ https://www.confluent.io/blog/avro-kafka-data/ https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#supported-formats
From a format comparison by data type:
@smorgan19 I wonder how you're dealing with a namespace
property, a naming convention, an optional field declaration, nesting schemas, etc. Is there an example set of schema/instance?
@hakjuoh
{ "namespace": "openapplications.org", "type": "record", "name": "GetPartyMaster", "fields": [ { "name": "ApplicationArea", "type": { "type": "record", "name": "ApplicationArea", "fields": [ { "name": "CreationDateTime", "type": [ "null", "string" ] } ] } }, { "name": "DataArea", "type": { "type": "record", "name": "DataArea", "fields": [ { "name": "Get", "type": [ "null", { "type": "record", "name": "Get", "fields": [ { "name": "GetUniqueIndicator", "type": [ "null", "int" ] } ] } ] }, { "name": "PartyMaster", "type": { "type": "array", "name": "PartyMaster", "items": { "type": "record", "name": "PartyMaster", "fields": [ { "name": "LastModificationDateTime", "type": [ { "type": "record", "name": "PartyMasterLastModificationDateTime", "fields": [ { "name": "content", "type": [ "null", "DateTime" ] }, { "name": "typeCode", "type": [ "null", "string" ] } ] }, "null" ] }, { "name": "Party", "type": [ { "type": "array", "name": "Party", "items": [ { "name": "PartyMasterParty", "type": "record", "fields": [ { "name": "typeCode", "type": [ "string", "null" ] }, { "name": "ID", "type": [ "null", { "type": "array", "name": "ID", "items": { "name": "PartyMasterPartyID", "type": "record", "fields": [ { "name": "typeCode", "type": [ "string", "null" ] }, { "name": "content", "type": [ "string", "null" ] } ] } } ] }, { "name": "Contact", "type": [ "null", { "type": "array", "name": "Contact", "items": { "name": "PartyMasterPartyContact", "type": "record", "fields": [ { "name": "typeCode", "type": [ "string", "null" ] }, { "name": "PersonName", "type": [ "null", { "type": "array", "name": "PersonName", "items": { "name": "PartyMasterPartyContactPersonName", "type": "record", "fields": [ { "name": "typeCode", "type": [ "string", "null" ] }, { "name": "FormattedName", "type": [ { "name": "PartyMasterPartyContactPersonNameFormattedName", "type": "record", "fields": [ { "name": "typeCode", "type": [ "string", "null" ] }, { "name": "content", "type": [ "string", "null" ] } ] }, "null" ] } ] } } ] } ] } } ] } ] } ] }, "null" ] } ] } } } ] } } ] }
From my understanding only records can't have the same name, so you could just put the full xpath or a shortened xpath. Optional fields, would be the type(string, int, ect) and then null.
@smorgan19 This is an example BIE used for testing
and the generated AVRO expression file, 'GetPartyMaster.avsc'
{
"namespace" : "org.openapplications",
"type" : "record",
"name" : "GetPartyMaster",
"fields" : [ {
"type" : "string",
"name" : "releaseID"
}, {
"name" : "ApplicationArea",
"type" : {
"type" : "record",
"name" : "ApplicationArea",
"fields" : [ {
"type" : "string",
"name" : "CreationDateTime"
} ]
}
}, {
"name" : "DataArea",
"type" : {
"type" : "record",
"name" : "DataArea",
"fields" : [ {
"name" : "Get",
"type" : {
"type" : "record",
"name" : "Get",
"fields" : [ {
"name" : "Expression",
"type" : {
"type" : "array",
"name" : "Expression",
"items" : {
"type" : "string",
"name" : "Expression"
}
}
}, {
"type" : [ "null", "boolean" ],
"name" : "uniqueIndicator"
} ]
}
}, {
"name" : "PartyMaster",
"type" : {
"type" : "array",
"name" : "PartyMaster",
"items" : {
"type" : "record",
"name" : "PartyMaster",
"fields" : [ {
"name" : "FinancialParty",
"type" : [ "null", {
"type" : "record",
"name" : "FinancialParty",
"fields" : [ {
"name" : "ID",
"type" : [ "null", {
"type" : "array",
"name" : "ID",
"items" : {
"type" : "record",
"name" : "FinancialPartyID",
"fields" : [ {
"type" : "string",
"name" : "content"
}, {
"type" : [ "null", "string" ],
"name" : "typeCode"
} ]
}
} ]
}, {
"name" : "Contact",
"type" : [ "null", {
"type" : "array",
"name" : "Contact",
"items" : {
"type" : "record",
"name" : "FinancialPartyContact",
"fields" : [ {
"type" : [ "null", "string" ],
"name" : "typeCode"
}, {
"name" : "ID",
"type" : [ "null", {
"type" : "array",
"name" : "ID",
"items" : {
"type" : "record",
"name" : "ContactID",
"fields" : [ {
"type" : "string",
"name" : "content"
}, {
"type" : [ "null", "string" ],
"name" : "typeCode"
} ]
}
} ]
}, {
"name" : "PersonName",
"type" : [ "null", {
"type" : "array",
"name" : "PersonName",
"items" : {
"type" : "record",
"name" : "FinancialPartyContactPersonName",
"fields" : [ {
"type" : [ "null", "string" ],
"name" : "typeCode"
}, {
"name" : "FormattedName",
"type" : [ "null", {
"type" : "record",
"name" : "FinancialPartyContactPersonNameFormattedName",
"fields" : [ {
"type" : "string",
"name" : "content"
}, {
"type" : [ "null", "string" ],
"name" : "typeCode"
} ]
} ]
} ]
}
} ]
} ]
}
} ]
} ]
} ]
}, {
"name" : "LastModificationDateTime",
"type" : [ "null", {
"type" : "record",
"name" : "LastModificationDateTime",
"fields" : [ {
"type" : "string",
"name" : "content"
}, {
"type" : [ "null", "string" ],
"name" : "typeCode"
} ]
} ]
}, {
"name" : "Party",
"type" : [ "null", {
"type" : "array",
"name" : "Party",
"items" : {
"type" : "record",
"name" : "Party",
"fields" : [ {
"type" : [ "null", "string" ],
"name" : "typeCode"
}, {
"name" : "ID",
"type" : [ "null", {
"type" : "array",
"name" : "ID",
"items" : {
"type" : "record",
"name" : "PartyID",
"fields" : [ {
"type" : "string",
"name" : "content"
}, {
"type" : [ "null", "string" ],
"name" : "typeCode"
} ]
}
} ]
}, {
"name" : "Contact",
"type" : [ "null", {
"type" : "array",
"name" : "Contact",
"items" : {
"type" : "record",
"name" : "PartyContact",
"fields" : [ {
"type" : [ "null", "string" ],
"name" : "typeCode"
}, {
"name" : "PersonName",
"type" : [ "null", {
"type" : "array",
"name" : "PersonName",
"items" : {
"type" : "record",
"name" : "PartyContactPersonName",
"fields" : [ {
"type" : [ "null", "string" ],
"name" : "typeCode"
}, {
"name" : "FormattedName",
"type" : [ "null", {
"type" : "record",
"name" : "PartyContactPersonNameFormattedName",
"fields" : [ {
"type" : "string",
"name" : "content"
}, {
"type" : [ "null", "string" ],
"name" : "typeCode"
} ]
} ]
} ]
}
} ]
} ]
}
} ]
} ]
}
} ]
} ]
}
}
} ]
}
} ]
}
and the java source files generated by avro-maven-plugin
.
generate-sources.zip
Please review this and let me know if you find any issues.
@hakjuoh, I should be able to review everything on Monday
@hakjuoh it looks good. I generated a sample as well. {"releaseID": "0.1", "ApplicationArea": {"CreationDateTime": "6-19-2023"}, "DataArea": {"Get": {"Expression": ["TestExpression"], "uniqueIndicator": null}, "PartyMaster": [{"FinancialParty": null, "LastModificationDateTime": null, "Party": [{"typeCode": "YellowCar", "ID": [{"content": "John Doe", "typeCode": "Driver"}, {"content": "Jane Doe", "typeCode": "PassengerOne"}], "Contact": null}, {"typeCode": null, "ID": [{"content": "Jimmy John", "typeCode": "Driver"}, {"content": "James John", "typeCode": "PassengerOne"}, {"content": "Carter John", "typeCode": "PassengerOne"}], "Contact": [{"typeCode": "DriverContact", "PersonName": [{"typeCode": null, "FormattedName": {"content": "James John", "typeCode": "CarContactPerson"}}]}]}]}]}}
@smorgan19 Thanks! I tested a validation for the sample using avro python package and found no errors.
import avro.schema
from avro.io import validate
schema = avro.schema.parse(avsc)
validate(schema, sample)
@smorgan19 The avro schema has some inconsistency in certain names that are nested under a parent component. Which creates inconsistency, incompatibility or problem in mapping when this AVRO schema based format is used along with XSD or JSON Schema based format. The example is 'PartyContactPersonNameFormattedName' instead of just 'FormattedName'.
@smorgan19 Changed the logic using the full path, and it works well. The name of records are pretty lengthy though.
{
"namespace" : "org.openapplications",
"type" : "record",
"name" : "GetPartyMaster",
"fields" : [ {
"type" : "string",
"name" : "releaseID"
}, {
"name" : "ApplicationArea",
"type" : {
"type" : "record",
"name" : "GetPartyMasterApplicationArea",
"fields" : [ {
"type" : "string",
"name" : "CreationDateTime"
} ]
}
}, {
"name" : "DataArea",
"type" : {
"type" : "record",
"name" : "GetPartyMasterDataArea",
"fields" : [ {
"name" : "Get",
"type" : {
"type" : "record",
"name" : "GetPartyMasterDataAreaGet",
"fields" : [ {
"name" : "Expression",
"type" : {
"type" : "array",
"name" : "Expression",
"items" : {
"type" : "string",
"name" : "Expression"
}
}
}, {
"type" : [ "null", "boolean" ],
"name" : "uniqueIndicator"
} ]
}
}, {
"name" : "PartyMaster",
"type" : {
"type" : "array",
"name" : "PartyMaster",
"items" : {
"type" : "record",
"name" : "GetPartyMasterDataAreaPartyMaster",
"fields" : [ {
"name" : "FinancialParty",
"type" : [ "null", {
"type" : "record",
"name" : "GetPartyMasterDataAreaPartyMasterFinancialParty",
"fields" : [ {
"name" : "ID",
"type" : [ "null", {
"type" : "array",
"name" : "ID",
"items" : {
"type" : "record",
"name" : "GetPartyMasterDataAreaPartyMasterFinancialPartyID",
"fields" : [ {
"type" : "string",
"name" : "content"
}, {
"type" : [ "null", "string" ],
"name" : "typeCode"
} ]
}
} ]
}, {
"name" : "Contact",
"type" : [ "null", {
"type" : "array",
"name" : "Contact",
"items" : {
"type" : "record",
"name" : "GetPartyMasterDataAreaPartyMasterFinancialPartyContact",
"fields" : [ {
"type" : [ "null", "string" ],
"name" : "typeCode"
}, {
"name" : "PersonName",
"type" : [ "null", {
"type" : "array",
"name" : "PersonName",
"items" : {
"type" : "record",
"name" : "GetPartyMasterDataAreaPartyMasterFinancialPartyContactPersonName",
"fields" : [ {
"type" : [ "null", "string" ],
"name" : "typeCode"
}, {
"name" : "FormattedName",
"type" : [ "null", {
"type" : "record",
"name" : "GetPartyMasterDataAreaPartyMasterFinancialPartyContactPersonNameFormattedName",
"fields" : [ {
"type" : "string",
"name" : "content"
}, {
"type" : [ "null", "string" ],
"name" : "typeCode"
} ]
} ]
} ]
}
} ]
} ]
}
} ]
} ]
} ]
}, {
"type" : [ "null", "string" ],
"name" : "LastModificationDateTime"
}, {
"name" : "Party",
"type" : [ "null", {
"type" : "array",
"name" : "Party",
"items" : {
"type" : "record",
"name" : "GetPartyMasterDataAreaPartyMasterParty",
"fields" : [ {
"type" : [ "null", "string" ],
"name" : "typeCode"
}, {
"name" : "ID",
"type" : [ "null", {
"type" : "array",
"name" : "ID",
"items" : {
"type" : "record",
"name" : "GetPartyMasterDataAreaPartyMasterPartyID",
"fields" : [ {
"type" : "string",
"name" : "content"
}, {
"type" : [ "null", "string" ],
"name" : "typeCode"
} ]
}
} ]
}, {
"name" : "Contact",
"type" : [ "null", {
"type" : "array",
"name" : "Contact",
"items" : {
"type" : "record",
"name" : "GetPartyMasterDataAreaPartyMasterPartyContact",
"fields" : [ {
"type" : [ "null", "string" ],
"name" : "typeCode"
}, {
"name" : "PersonName",
"type" : [ "null", {
"type" : "array",
"name" : "PersonName",
"items" : {
"type" : "record",
"name" : "GetPartyMasterDataAreaPartyMasterPartyContactPersonName",
"fields" : [ {
"type" : [ "null", "string" ],
"name" : "typeCode"
}, {
"name" : "FormattedName",
"type" : [ "null", {
"type" : "record",
"name" : "GetPartyMasterDataAreaPartyMasterPartyContactPersonNameFormattedName",
"fields" : [ {
"type" : "string",
"name" : "content"
}, {
"type" : [ "null", "string" ],
"name" : "typeCode"
} ]
} ]
} ]
}
} ]
} ]
}
} ]
} ]
}
} ]
} ]
}
}
} ]
}
} ]
}
You can test this function on test.oagiscore.net
.
Avro is used in various processes for data serialization. It has rich data structures, is compact, fast, and is commonly used with Kafka, Hadoop, AWS and more. AVRO data serialization requires AVSC Schema Format which is fairly similar to JSON, but has a different data type.