Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
513 stars 131 forks source link

[FEATURE] Bigquery pseudo-column support #186

Open adammarples opened 1 year ago

adammarples commented 1 year ago

It would be good to be able to use the _FILE_NAME pseudo-column which bigquery uses to hold the filename of the files in an external table.

Describe the solution you'd like To do this, it would need to be selected in the initial stage select CTE, because it is unavailable after that, ie. when derived columns are added. Presumably the best way to do that is to create a macros/tables/bigquery/stage.sql file with an "external_table_filename" kwarg flag, or if this could be covered with a src_extra_columns variable? There are other pseudo-columns too such as in partitioned tables.

Describe alternatives you've considered

Additional context Happy to work on this if it seems appropriate

AB#5366

DVAlexHiggs commented 1 year ago

Hi! Thanks for this interesting feature request.

I have to admit I don't know much about this bit of BigQuery.

My questions would be: is this something that should be dealt with in dbtvault or is it in fact already built into dbt? From the BigQuery dbt configs documentation it does allude to pseudo-columns.

Otherwise, this should already be possible doing something like the following:


WITH stage AS (
   {{ dbtvault.stage() }} 
), 

pseudocolumns AS (
   SELECT *, _FILE_NAME AS FILE_NAME
   FROM stage
)

SELECT * FROM pseudocolumns

However, this is a little clunky and some kind of config variable as suggested would be a lot cleaner.

adammarples commented 1 year ago

Hi. The problem as I see it there (and I am not an expert on dbt-vault or dbt-bigquery or how they interact) is that _FILE_NAME can only be selected directly from the source table, if it is not included in the initial CTE then it will be unavailable going forward. It will also require an alias ie. 'FILE_NAME'.

dbt-vault provides a file called dbtvault/macros/staging/stage.sql which I believe is what dbt-vault runs when dbtvault.stage() is called, and the _FILE_NAME needs to be inserted right there at the source.

WITH source_data AS (

    SELECT

    {{- "\n\n    " ~ dbtvault.print_list(list_to_print=all_source_columns, columns_to_escape=columns_to_escape) if all_source_columns else " *" }}
  , _FILE_NAME AS FILE_NAME -- <- we need this line optionally
    FROM {{ source_relation }}
    {%- set last_cte = "source_data" %}
)

I can have a go at a PR

DVAlexHiggs commented 1 year ago

Ok that makes sense, thanks for explaining!

I will add this to our backlog :) Happy for you to do a PR but at this time we cannot accept directly as our test harness is not publicly available and it does need to be run through that before release. We hope to improve contribution guidelines and processes in future.