Azure-Samples / modern-data-warehouse-dataops

DataOps for Microsoft Data Platform technologies. https://aka.ms/dataops-repo
MIT License
590 stars 462 forks source link

Agree on the Lakehouse setup for Parking sensors data #840

Open sreedhar-guda opened 1 week ago

sreedhar-guda commented 1 week ago

Work with the team to agree on the asset organization for Parking sensors lake house.

Env Questions:

Configurations and data files:

[Sreedhar] What is the git location for config files (storing DDLs, process name etc.,). I provided a template which needs to be substituted with actual values before copying to Fabric. The files and target locations will be:

  • reference data (dim_date.csv and dim_time.csv) --> target should be Lakehouse Files/data
  • in git under config/parking_sensors; Files parking_sensors_ddls.yaml and parking_sensors.cfg --> target should be Lakehouse Files/config
  • parking_sensors.cfg is derived from parking_sensors.cfg.template for each stage/env after substituting with appropriate values - such as workspace id belonging to that particular stage. Need to cross-check the substitution strings and format with Anuj and Naga.

Workspace:

Can we not have special characters in the name?

[Sreedhar]see the comment section for details

Notebook Linting:

[Sreedhar]: nit - noticed that all the imports are being moved the very first cell by the linting process. If users are using %%configure or parameters cell as the first cell and want that one as the very first one (perhaps to do conditional imports), this could have unintentional consequences.

DoD: Agree on:

Related to #846

sreedhar-guda commented 6 days ago

Suggestion: Have a naming convention in place and mentioned in the README.md

Here are the assets I am looking for from setup perspective:

devlace commented 2 days ago

@sreedhar-guda I believe most of these already are set in the current TF deployment scripts: https://github.com/Azure-Samples/modern-data-warehouse-dataops/blob/feat/e2e-fabric-dataops-sample/e2e_samples/fabric_dataops_sample/infra/terraform/locals.tf

Only thing missing is the reference data. We can just call those dim_time and dim_date.

Let's park the utilities lakehouse for now and keep it simple with just the existing lakehouse. (single lakehouse)

sreedhar-guda commented 2 days ago

@promisinganuj @devlace @naga-nandyala

Currently spark queries will fail if the workspace name has a special character in it when using fully qualified names( FQN) for table references. See: https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-schemas#public-preview-limitations for more details.

We are developing notebooks with FQN - which included workspace name as well as a best practice. Our Terraform modules use "-" in the name of the workspace. This will result in errors when running queries using FQN.

Options to address this:

  1. Remove special characters from workspace name
  2. Update the notebooks to remove workspace name from FQN

Let me know if option 1 is something we are open to do. If not, I will update the notebook accordingly.