data-solution-automation-engine / data-warehouse-automation-metadata-schema

Generic interface exchange format for Data Warehouse Automation and ETL generation.
GNU Lesser General Public License v3.0
38 stars 11 forks source link

Handling more complex statements than direct 1-to-1 field mappings #8

Closed ReinierKop closed 4 years ago

ReinierKop commented 4 years ago

Great initiative to have this! I'm looking into metadata specifications to use to power our ETL generation and source to target mappings and I'm happy to follow conventions such as the one you've created.

Most examples in this repo are understandably relatively simple and only speak of field-to-field mappings (e.g. sourceTable.sourceColumn mapping to targetTable.targetcolumn). This will often be enough, but lots of our ETL code (and I'm sure we're not unique in this regard) is much more complex.

Some specific examples of functionality I am talking about: group by, distinct, aggregations/subqueries/subselects/joins/ctes.

Are we meant to just put the complex query in the sourceDataObject as the schema implies here or do other methods of specifying this exist (or are perhaps planned)?

Just one example adapted from our ETL code, which I tried to map to the schema below:

Our ETL with a join:

insert into target_table
select
    source_table_1.id,
    source_table_2.description
from source_table_1
inner join source_table_2
    on source_table_1.some_id = source_table_2.some_id

Metadata specification how I understand it right now:

{
  "dataObjectMappingList": [
    {
      "mappingName": "mapping_with_join",
      "sourceDataObject": {
        "name": "select source_table_1.id, source_table_2.description from source_table_1 inner join source_table_2 on source_table_1.some_id = source_table_2.some_id"
      },
      "targetDataObject": {
        "name": "target_table"
      },
      "dataItemMapping": [
        {
          "sourceDataItem": {
            "name": "id"
          },
          "targetDataItem": {
            "name": "id"
          }
        },
        {
          "sourceDataItem": {
            "name": "description"
          },
          "targetDataItem": {
            "name": "description"
          }
        }
      ]
    }
  ]
}

I realize it probably wouldn't be easy nor desirable to extend the schema to support all SQL, but I'm just wondering whether I'm on the right track or not.

RoelantVos commented 4 years ago

Hi Reinier,

Thanks for this, you are definitely on the right track. In my view at the moment there are two options for this;

There specifically is an option to put a query in the 'source' (as you point out) to avoid having to write a full domain specific language to suit all transformation use-cases. To avoid rebuilding SQL (again as you mention) the OneOf 'query' or 'item' is possible both at DataObject as well as DataItem level. This makes it possible to put a query in the source entirely or parts as mappings to items and build a transformation this way.

How this is used in code / templating engines provides another layer of options of course, as this can define how things are used in practice also - and your example of using the query in the name would work too this way.

The Related Data Object is an array of objects you can use in the templating engines or upstream code. I use this for lookups and joins a lot - for example here: https://github.com/RoelantVos/Virtual-Data-Warehouse/blob/master/LoadPatterns/loadPatternStagingView.Handlebars.

This would look like: ], "sourceDataObject": { "name": "source.A" }, "targetDataObject": { "name": "landing.A", "dataObjectConnection": { "dataConnectionString": "PersistentStagingDatabase" } }, "relatedDataObject": [ { "name": "psa.A", "dataObjectClassification": [ { "classification": "Lookup", "notes": "Lookup table related to the source-to-target mapping" } ] } ],

I am collecting feedback from various projects into an updated version, which I'll work on next week when I have a week off. I'm creating issues for this now. I will add some more complex examples as well.

Hope this makes sense, and thank you for your feedback!

Kind regards - Roelant

ReinierKop commented 4 years ago

Thanks a lot for the quick response! This definitely helped me create a better understanding of the possibilities within the configuration. I'll keep an eye on the repo and meanwhile start fiddling around with this

RoelantVos commented 4 years ago

Hi Reinier, FYI just in case I made some modifications to the schema and also added an example for the above.

ReinierKop commented 4 years ago

Hi Roelant, thanks a lot! I'll check it out :) If I've any more remarks or questions I'll let you know 👍