Generate schema.yml when generating metaextracts

jp-vanheerden commented 2 months ago

It would be useful to create the 'source' schema.yml file when creating metaextracts.

KRodriguez-Insight commented 1 month ago

Update : As part of metadata extract, logic has been added to extract schema.yml file for all the tables in the lakehouse

Example : There are four lakehouse in the workspace, and four yml file for each lakehouse is generated as shown below.

Inside each yml file, basic information (column name) are stored, later this can be extended to have test cases and other details for the DBT model

KRodriguez-Insight commented 1 month ago

Modified the scripts to include description (Place holder) as show below

The templates are written in the SchemaTemplate Folder and partitioned by lakehouse/database

annstephyjose commented 1 month ago

tested and working fine

jrampono commented 1 month ago

Hi @KRodriguez-Insight and @insightgrantkrieger, good work on this but I've had to reopens issue as I found an issue in one of my environments where non-json safe characters in the column names or table names cause a failure.

I took a look at the code and I'm a little worried about custom creating the json / yml object via string manipulation. I think its a bit brittle. Since we have already extracted all of the schemal information in the "Describe Table Extended" cell I think that we can achieve the same result in a more efficient way by injecting the code below in a new cell after we have created "union_dtable". This would also remove the need for the additional tmp views that we have created. Could you guys please take a look?

tables = {}
for item in union_dtable.collect():
    #print(item)
    table_name = item['tableName']
    col_name = item['col_name']

    if table_name not in tables:
        tables[table_name] = []

    if col_name:  # Only add columns with a name
        tables[table_name].append({
            "name": col_name,
            "description": "add column description"
        })

output_data = {}
output_data["version"] = 2
models = []
for table_name, columns in tables.items():
    models.append({
        "name": table_name,
        "description": "add table description",
        "columns": columns
    })
output_data["models"] = models
import yaml
yaml_string = yaml.dump(output_data,  default_flow_style=False)
RelativePathForShemaTemplate = "Files/SchemaTemplate"
file_path = f"{RelativePathForShemaTemplate}/schema.yml"
mssparkutils.fs.put(file_path, yaml_string, True) # type: ignore

THE CODE ABOVE WOULD GO IN A NEW CELL AFTER THIS ONE:

KRodriguez-Insight commented 1 month ago

@jrampono @insightgrantkrieger Thank you John , I have updated the notebook logic and tested locally.

One observation on Describe extended table, We noticed that the describe extended adds some additional details to the yaml file, so i have preserved the describe table. Please let us know if you have any concerns / feedback.

Additional properties in describe extended

KRodriguez-Insight commented 1 month ago

cc: @insightgrantkrieger @annstephyjose

Tested on Lakehouse column name with symbols ( % , # ,&,*) and made the following observations

The schema.yml template got generated successfully.
However the Model file did not accept '%', but accepted all other character ( To create a separate ticket to track / document best practice)
When there are table with same name in different lake house withing a workspace, the yaml generation of that table is overwritten (understand dbt does not support same table name within the project, To be documented as best practice)
When there is a symbol with # , the SQL generation as python string in notebook failed because of symbols To be enclose as tilde( To create a separate ticket)

KRodriguez-Insight commented 1 month ago

cc: @insightgrantkrieger @annstephyjose

Tested on Lakehouse column name with symbols ( % , # ,&,*) and made the following observations

The schema.yml template got generated successfully.
However the Model file did not accept '%', but accepted all other character ( To create a separate ticket to track / document best practice)
When there are table with same name in different lake house withing a workspace, the yaml generation of that table is overwritten (understand dbt does not support same table name within the project, To be documented as best practice)
When there is a symbol with # , the SQL generation as python string in notebook failed because of symbols To be enclose as tilde( To create a separate ticket)

Insight-Services-APAC / APAC-Capability-DAI-DbtFabricSparkNb

Generate schema.yml when generating metaextracts #31