duxbuse / ninthage-data-analytics

A better way for the project to ingest tournament results
GNU General Public License v3.0
1 stars 0 forks source link

Automatic data quality testing #185

Open SanderND opened 1 year ago

SanderND commented 1 year ago

Create inbuilt tests on data on the back-end: fields cannot be null, must be unique, must be positive etc. Goal: ensure data quality already on the back-end.

Inspiration came from Snowflake and dbt which have this inbuilt. I think GCS will offer the same.

duxbuse commented 1 year ago

So issue here is that we decided that this was valid data. It is very easy to make certain fields no longer optional. But we wanted some partial data. For instance with the manual game report, we as a team decided that all fields should be optional

SanderND commented 1 year ago

I think what you mean is restrict the fields in the table itself. What Snowflake has is automatic tests performed on the tables. Thus not restricting the fields itself (as we do) but then checking if the data still makes sense. I really like that feature as it warns us for bad stuff before it reaches Tableau.

SanderND commented 1 year ago

Meanwhile I found out dbt (https://cloud.getdbt.com/), which i use professionally is available for free. Its open source and connects natively to GitHub for documentation. Amongst other dbt offers us a testing module where you can write SQL queries to test our data and report back on that.

I already requested a private dbt user. To connect to BiqQuery we must follow this: https://docs.getdbt.com/docs/get-started/getting-started/getting-set-up/setting-up-bigquery which says: "In order to let dbt connect to your warehouse, you'll need to generate a keyfile. This is analogous to using a database user name and password with most other data warehouses.

Go to the [BigQuery credential wizard](https://console.cloud.google.com/apis/credentials/wizard). Make sure your new project is selected in the header. If you do not see your account or project, click your profile picture to the right and verify your are using the correct email account.
Select + Create Credentials then select Service account.
Type dbt-user in the Service account name field, then click Create and Continue.
Type and select BigQuery Admin in the Role field.
Click Continue.
Leave fields blank in the "Grant users access to this service account" section and click Done.
Click the service account that you just created.
Select Keys.
Click Add Key then select Create new key.
Select JSON as the key type then click Create.
You should be prompted to download the [JSON](https://docs.getdbt.com/terms/json) file. Save it locally to an easy-to-remember spot, with a clear filename. For example, dbt-user-creds.json."

Can you help me with this @duxbuse