OAGi / Score

Score
MIT License
9 stars 6 forks source link

Addition of AVSC/AVRO as an Expressable Schema Format #1500

Closed smorgan19 closed 1 year ago

smorgan19 commented 1 year ago

Avro is used in various processes for data serialization. It has rich data structures, is compact, fast, and is commonly used with Kafka, Hadoop, AWS and more. AVRO data serialization requires AVSC Schema Format which is fairly similar to JSON, but has a different data type.

dubnemo commented 1 year ago

Perhaps references to the schema definition would be helpful, as this is a meta-model mapping exercise. It seems the schema is expressed in JSON (not JSON) schema. Are you willing to help define this? It will take a lot of time and effort.

We were using Hadoop for a while, then moved away from it. So we have very limited interest. We have much more interest in expanding / completing OpenAPI capabilities as a priority, as we have not completed the features that had been defined by the API Work Group.

On Tue, May 23, 2023 at 12:03 PM smorgan19 @.***> wrote:

Avro is used in various processes for data serialization. It has rich data structures, is compact, fast, and is commonly used with Kafka, Hadoop, AWS and more. AVRO data serialization requires AVSC Schema Format which is fairly similar to JSON, but has a different data structure.

— Reply to this email directly, view it on GitHub https://github.com/OAGi/Score/issues/1500, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHXOQO6ZFWWSA65R4KVMDADXHTUV7ANCNFSM6AAAAAAYMF4ESU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

smorgan19 commented 1 year ago

For storing data in data lakes or lake houses or for certain tools like kafka or hadoop, using the big data formats like AVRO, ORC, and Parquet are typically recommended.

ORC stands for Optimized Row Columnar (ORC) file format. This is a columnar file format and divided into header, body and footer

AVRO is an open source object container file format. Unlike the other two formats, it features row-based storage. Avro stores data definition in JSON so data can be easily read and interpreted. It uses the JSON file format for defining the data types, protocols and serializes the data in a compact binary format, making for efficient, resource-sparing storage

Parquet is an columnar data storage format that supports complex nested data structures in a flat columnar format. Parquet is perfect for services like AWS Athena and Amazon Redshift Spectrum which are serverless, interactive technologies.

Out of the three big data formats AVRO stand out for the following reasons:

Schema Based Format:

Ability to transform avro data to ORC and PARQUET formats

Industry Usage:

From a developer perspective avro has maven plugins and other resources that make it easier to develop with and allows for transformations into the other big data formats like ORC and Parquet.

There are limited tool options available to convert from XSD or JSON Schema to AVSC schema format. Those that are available are either outdated or not maintained.

Resources: https://bryteflow.com/how-to-choose-between-parquet-orc-and-avro/ https://avro.apache.org/docs/1.11.1/specification/ https://avro.apache.org/docs/1.11.1/getting-started-java/ https://data-flair.training/blogs/avro-uses/ https://www.upsolver.com/blog/the-file-format-fundamentals-of-big-data https://docs.oracle.com/cd/E26161_02/html/GettingStartedGuide/avroschemas.html#:~:text=Avro%20is%20used%20to%20define,Database%20record%20using%20Avro%20bindings. https://blog.knoldus.com/all-you-need-to-know-about-avro-schema/ https://www.confluent.io/blog/avro-kafka-data/ https://www.confluent.io/blog/avro-kafka-data/ https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#supported-formats

smorgan19 commented 1 year ago

From a format comparison by data type: image

hakjuoh commented 1 year ago

@smorgan19 I wonder how you're dealing with a namespace property, a naming convention, an optional field declaration, nesting schemas, etc. Is there an example set of schema/instance?

smorgan19 commented 1 year ago

@hakjuoh { "namespace": "openapplications.org", "type": "record", "name": "GetPartyMaster", "fields": [ { "name": "ApplicationArea", "type": { "type": "record", "name": "ApplicationArea", "fields": [ { "name": "CreationDateTime", "type": [ "null", "string" ] } ] } }, { "name": "DataArea", "type": { "type": "record", "name": "DataArea", "fields": [ { "name": "Get", "type": [ "null", { "type": "record", "name": "Get", "fields": [ { "name": "GetUniqueIndicator", "type": [ "null", "int" ] } ] } ] }, { "name": "PartyMaster", "type": { "type": "array", "name": "PartyMaster", "items": { "type": "record", "name": "PartyMaster", "fields": [ { "name": "LastModificationDateTime", "type": [ { "type": "record", "name": "PartyMasterLastModificationDateTime", "fields": [ { "name": "content", "type": [ "null", "DateTime" ] }, { "name": "typeCode", "type": [ "null", "string" ] } ] }, "null" ] }, { "name": "Party", "type": [ { "type": "array", "name": "Party", "items": [ { "name": "PartyMasterParty", "type": "record", "fields": [ { "name": "typeCode", "type": [ "string", "null" ] }, { "name": "ID", "type": [ "null", { "type": "array", "name": "ID", "items": { "name": "PartyMasterPartyID", "type": "record", "fields": [ { "name": "typeCode", "type": [ "string", "null" ] }, { "name": "content", "type": [ "string", "null" ] } ] } } ] }, { "name": "Contact", "type": [ "null", { "type": "array", "name": "Contact", "items": { "name": "PartyMasterPartyContact", "type": "record", "fields": [ { "name": "typeCode", "type": [ "string", "null" ] }, { "name": "PersonName", "type": [ "null", { "type": "array", "name": "PersonName", "items": { "name": "PartyMasterPartyContactPersonName", "type": "record", "fields": [ { "name": "typeCode", "type": [ "string", "null" ] }, { "name": "FormattedName", "type": [ { "name": "PartyMasterPartyContactPersonNameFormattedName", "type": "record", "fields": [ { "name": "typeCode", "type": [ "string", "null" ] }, { "name": "content", "type": [ "string", "null" ] } ] }, "null" ] } ] } } ] } ] } } ] } ] } ] }, "null" ] } ] } } } ] } } ] }

From my understanding only records can't have the same name, so you could just put the full xpath or a shortened xpath. Optional fields, would be the type(string, int, ect) and then null.

hakjuoh commented 1 year ago

@smorgan19 This is an example BIE used for testing

Screenshot 2023-06-14 at 2 34 30 PM

and the generated AVRO expression file, 'GetPartyMaster.avsc'

{
  "namespace" : "org.openapplications",
  "type" : "record",
  "name" : "GetPartyMaster",
  "fields" : [ {
    "type" : "string",
    "name" : "releaseID"
  }, {
    "name" : "ApplicationArea",
    "type" : {
      "type" : "record",
      "name" : "ApplicationArea",
      "fields" : [ {
        "type" : "string",
        "name" : "CreationDateTime"
      } ]
    }
  }, {
    "name" : "DataArea",
    "type" : {
      "type" : "record",
      "name" : "DataArea",
      "fields" : [ {
        "name" : "Get",
        "type" : {
          "type" : "record",
          "name" : "Get",
          "fields" : [ {
            "name" : "Expression",
            "type" : {
              "type" : "array",
              "name" : "Expression",
              "items" : {
                "type" : "string",
                "name" : "Expression"
              }
            }
          }, {
            "type" : [ "null", "boolean" ],
            "name" : "uniqueIndicator"
          } ]
        }
      }, {
        "name" : "PartyMaster",
        "type" : {
          "type" : "array",
          "name" : "PartyMaster",
          "items" : {
            "type" : "record",
            "name" : "PartyMaster",
            "fields" : [ {
              "name" : "FinancialParty",
              "type" : [ "null", {
                "type" : "record",
                "name" : "FinancialParty",
                "fields" : [ {
                  "name" : "ID",
                  "type" : [ "null", {
                    "type" : "array",
                    "name" : "ID",
                    "items" : {
                      "type" : "record",
                      "name" : "FinancialPartyID",
                      "fields" : [ {
                        "type" : "string",
                        "name" : "content"
                      }, {
                        "type" : [ "null", "string" ],
                        "name" : "typeCode"
                      } ]
                    }
                  } ]
                }, {
                  "name" : "Contact",
                  "type" : [ "null", {
                    "type" : "array",
                    "name" : "Contact",
                    "items" : {
                      "type" : "record",
                      "name" : "FinancialPartyContact",
                      "fields" : [ {
                        "type" : [ "null", "string" ],
                        "name" : "typeCode"
                      }, {
                        "name" : "ID",
                        "type" : [ "null", {
                          "type" : "array",
                          "name" : "ID",
                          "items" : {
                            "type" : "record",
                            "name" : "ContactID",
                            "fields" : [ {
                              "type" : "string",
                              "name" : "content"
                            }, {
                              "type" : [ "null", "string" ],
                              "name" : "typeCode"
                            } ]
                          }
                        } ]
                      }, {
                        "name" : "PersonName",
                        "type" : [ "null", {
                          "type" : "array",
                          "name" : "PersonName",
                          "items" : {
                            "type" : "record",
                            "name" : "FinancialPartyContactPersonName",
                            "fields" : [ {
                              "type" : [ "null", "string" ],
                              "name" : "typeCode"
                            }, {
                              "name" : "FormattedName",
                              "type" : [ "null", {
                                "type" : "record",
                                "name" : "FinancialPartyContactPersonNameFormattedName",
                                "fields" : [ {
                                  "type" : "string",
                                  "name" : "content"
                                }, {
                                  "type" : [ "null", "string" ],
                                  "name" : "typeCode"
                                } ]
                              } ]
                            } ]
                          }
                        } ]
                      } ]
                    }
                  } ]
                } ]
              } ]
            }, {
              "name" : "LastModificationDateTime",
              "type" : [ "null", {
                "type" : "record",
                "name" : "LastModificationDateTime",
                "fields" : [ {
                  "type" : "string",
                  "name" : "content"
                }, {
                  "type" : [ "null", "string" ],
                  "name" : "typeCode"
                } ]
              } ]
            }, {
              "name" : "Party",
              "type" : [ "null", {
                "type" : "array",
                "name" : "Party",
                "items" : {
                  "type" : "record",
                  "name" : "Party",
                  "fields" : [ {
                    "type" : [ "null", "string" ],
                    "name" : "typeCode"
                  }, {
                    "name" : "ID",
                    "type" : [ "null", {
                      "type" : "array",
                      "name" : "ID",
                      "items" : {
                        "type" : "record",
                        "name" : "PartyID",
                        "fields" : [ {
                          "type" : "string",
                          "name" : "content"
                        }, {
                          "type" : [ "null", "string" ],
                          "name" : "typeCode"
                        } ]
                      }
                    } ]
                  }, {
                    "name" : "Contact",
                    "type" : [ "null", {
                      "type" : "array",
                      "name" : "Contact",
                      "items" : {
                        "type" : "record",
                        "name" : "PartyContact",
                        "fields" : [ {
                          "type" : [ "null", "string" ],
                          "name" : "typeCode"
                        }, {
                          "name" : "PersonName",
                          "type" : [ "null", {
                            "type" : "array",
                            "name" : "PersonName",
                            "items" : {
                              "type" : "record",
                              "name" : "PartyContactPersonName",
                              "fields" : [ {
                                "type" : [ "null", "string" ],
                                "name" : "typeCode"
                              }, {
                                "name" : "FormattedName",
                                "type" : [ "null", {
                                  "type" : "record",
                                  "name" : "PartyContactPersonNameFormattedName",
                                  "fields" : [ {
                                    "type" : "string",
                                    "name" : "content"
                                  }, {
                                    "type" : [ "null", "string" ],
                                    "name" : "typeCode"
                                  } ]
                                } ]
                              } ]
                            }
                          } ]
                        } ]
                      }
                    } ]
                  } ]
                }
              } ]
            } ]
          }
        }
      } ]
    }
  } ]
}

and the java source files generated by avro-maven-plugin. generate-sources.zip

Please review this and let me know if you find any issues.

smorgan19 commented 1 year ago

@hakjuoh, I should be able to review everything on Monday

smorgan19 commented 1 year ago

@hakjuoh it looks good. I generated a sample as well. {"releaseID": "0.1", "ApplicationArea": {"CreationDateTime": "6-19-2023"}, "DataArea": {"Get": {"Expression": ["TestExpression"], "uniqueIndicator": null}, "PartyMaster": [{"FinancialParty": null, "LastModificationDateTime": null, "Party": [{"typeCode": "YellowCar", "ID": [{"content": "John Doe", "typeCode": "Driver"}, {"content": "Jane Doe", "typeCode": "PassengerOne"}], "Contact": null}, {"typeCode": null, "ID": [{"content": "Jimmy John", "typeCode": "Driver"}, {"content": "James John", "typeCode": "PassengerOne"}, {"content": "Carter John", "typeCode": "PassengerOne"}], "Contact": [{"typeCode": "DriverContact", "PersonName": [{"typeCode": null, "FormattedName": {"content": "James John", "typeCode": "CarContactPerson"}}]}]}]}]}}

hakjuoh commented 1 year ago

@smorgan19 Thanks! I tested a validation for the sample using avro python package and found no errors.

import avro.schema
from avro.io import validate

schema = avro.schema.parse(avsc)
validate(schema, sample)
joshklm commented 1 year ago

@smorgan19 The avro schema has some inconsistency in certain names that are nested under a parent component. Which creates inconsistency, incompatibility or problem in mapping when this AVRO schema based format is used along with XSD or JSON Schema based format. The example is 'PartyContactPersonNameFormattedName' instead of just 'FormattedName'.

hakjuoh commented 1 year ago

@smorgan19 Changed the logic using the full path, and it works well. The name of records are pretty lengthy though.

{
  "namespace" : "org.openapplications",
  "type" : "record",
  "name" : "GetPartyMaster",
  "fields" : [ {
    "type" : "string",
    "name" : "releaseID"
  }, {
    "name" : "ApplicationArea",
    "type" : {
      "type" : "record",
      "name" : "GetPartyMasterApplicationArea",
      "fields" : [ {
        "type" : "string",
        "name" : "CreationDateTime"
      } ]
    }
  }, {
    "name" : "DataArea",
    "type" : {
      "type" : "record",
      "name" : "GetPartyMasterDataArea",
      "fields" : [ {
        "name" : "Get",
        "type" : {
          "type" : "record",
          "name" : "GetPartyMasterDataAreaGet",
          "fields" : [ {
            "name" : "Expression",
            "type" : {
              "type" : "array",
              "name" : "Expression",
              "items" : {
                "type" : "string",
                "name" : "Expression"
              }
            }
          }, {
            "type" : [ "null", "boolean" ],
            "name" : "uniqueIndicator"
          } ]
        }
      }, {
        "name" : "PartyMaster",
        "type" : {
          "type" : "array",
          "name" : "PartyMaster",
          "items" : {
            "type" : "record",
            "name" : "GetPartyMasterDataAreaPartyMaster",
            "fields" : [ {
              "name" : "FinancialParty",
              "type" : [ "null", {
                "type" : "record",
                "name" : "GetPartyMasterDataAreaPartyMasterFinancialParty",
                "fields" : [ {
                  "name" : "ID",
                  "type" : [ "null", {
                    "type" : "array",
                    "name" : "ID",
                    "items" : {
                      "type" : "record",
                      "name" : "GetPartyMasterDataAreaPartyMasterFinancialPartyID",
                      "fields" : [ {
                        "type" : "string",
                        "name" : "content"
                      }, {
                        "type" : [ "null", "string" ],
                        "name" : "typeCode"
                      } ]
                    }
                  } ]
                }, {
                  "name" : "Contact",
                  "type" : [ "null", {
                    "type" : "array",
                    "name" : "Contact",
                    "items" : {
                      "type" : "record",
                      "name" : "GetPartyMasterDataAreaPartyMasterFinancialPartyContact",
                      "fields" : [ {
                        "type" : [ "null", "string" ],
                        "name" : "typeCode"
                      }, {
                        "name" : "PersonName",
                        "type" : [ "null", {
                          "type" : "array",
                          "name" : "PersonName",
                          "items" : {
                            "type" : "record",
                            "name" : "GetPartyMasterDataAreaPartyMasterFinancialPartyContactPersonName",
                            "fields" : [ {
                              "type" : [ "null", "string" ],
                              "name" : "typeCode"
                            }, {
                              "name" : "FormattedName",
                              "type" : [ "null", {
                                "type" : "record",
                                "name" : "GetPartyMasterDataAreaPartyMasterFinancialPartyContactPersonNameFormattedName",
                                "fields" : [ {
                                  "type" : "string",
                                  "name" : "content"
                                }, {
                                  "type" : [ "null", "string" ],
                                  "name" : "typeCode"
                                } ]
                              } ]
                            } ]
                          }
                        } ]
                      } ]
                    }
                  } ]
                } ]
              } ]
            }, {
              "type" : [ "null", "string" ],
              "name" : "LastModificationDateTime"
            }, {
              "name" : "Party",
              "type" : [ "null", {
                "type" : "array",
                "name" : "Party",
                "items" : {
                  "type" : "record",
                  "name" : "GetPartyMasterDataAreaPartyMasterParty",
                  "fields" : [ {
                    "type" : [ "null", "string" ],
                    "name" : "typeCode"
                  }, {
                    "name" : "ID",
                    "type" : [ "null", {
                      "type" : "array",
                      "name" : "ID",
                      "items" : {
                        "type" : "record",
                        "name" : "GetPartyMasterDataAreaPartyMasterPartyID",
                        "fields" : [ {
                          "type" : "string",
                          "name" : "content"
                        }, {
                          "type" : [ "null", "string" ],
                          "name" : "typeCode"
                        } ]
                      }
                    } ]
                  }, {
                    "name" : "Contact",
                    "type" : [ "null", {
                      "type" : "array",
                      "name" : "Contact",
                      "items" : {
                        "type" : "record",
                        "name" : "GetPartyMasterDataAreaPartyMasterPartyContact",
                        "fields" : [ {
                          "type" : [ "null", "string" ],
                          "name" : "typeCode"
                        }, {
                          "name" : "PersonName",
                          "type" : [ "null", {
                            "type" : "array",
                            "name" : "PersonName",
                            "items" : {
                              "type" : "record",
                              "name" : "GetPartyMasterDataAreaPartyMasterPartyContactPersonName",
                              "fields" : [ {
                                "type" : [ "null", "string" ],
                                "name" : "typeCode"
                              }, {
                                "name" : "FormattedName",
                                "type" : [ "null", {
                                  "type" : "record",
                                  "name" : "GetPartyMasterDataAreaPartyMasterPartyContactPersonNameFormattedName",
                                  "fields" : [ {
                                    "type" : "string",
                                    "name" : "content"
                                  }, {
                                    "type" : [ "null", "string" ],
                                    "name" : "typeCode"
                                  } ]
                                } ]
                              } ]
                            }
                          } ]
                        } ]
                      }
                    } ]
                  } ]
                }
              } ]
            } ]
          }
        }
      } ]
    }
  } ]
}

You can test this function on test.oagiscore.net.