Include `data_type` in the output of `generate_model_yaml`

dbeatty10 commented 1 year ago

Describe the feature

When generating output, codegen should have an option to include data_type like the following:

    columns:
      - name: account_id
        data_type: string

Furthermore, this option should probably default to true. It possibly shouldn't even be optional, but just always be included in the generated output.

See here for more context: https://github.com/dbt-labs/internal-analytics/pull/1418#discussion_r1144991800

Describe alternatives you've considered

An alternative is to copy-paste data_type by hand into the YAML. These would be copied from an information_schema.columns query or show columns, etc.

Additional context

Side note to spin out into a separate issue: reconciling greenfield YAML from codegen with pre-existing YAML is annoying / time-consuming. Can we write some kind of guidance on easier ways to reconcile two YAML files?

Who will this benefit?

With the launch of data contracts in the dbt-core 1.5.0 release, it's handy to automatically generate the relevant data_types for pre-existing models.

Are you interested in contributing this feature?

Happy to review the PR if someone else wants to take this on.

linbug commented 1 year ago

@dbeatty10 would something like this be suitable? My concern now is that all tests that invoke this macro will need to include logic for all target types, since they use different data type names (e.g. text in postgres becomes string in bigquery). That might be cumbersome. Do you have any feedback/ suggestions?

dbeatty10 commented 1 year ago

That PR is looking good @linbug !

Absolutely agreed that the testing part looks cumbersome. #76 ran into something similar. We discussed and tried a couple options before settling on an approach.

There are a few options for your PR:

Adopt a similar testing approach as #76
Try some other testing approach
Skip tests altogether

Want to take a look at these options and let me know what you think?

You won't hear me say this too often, but I might be comfortable with skipping tests in this case if it has the least downsides of the options.

dbeatty10 commented 1 year ago

Always-on vs. optional

In the original issue description, it said:

this option should probably default to true. It possibly shouldn't even be optional, but just always be included in the generated output.

It is easy enough to just add a new include_data_types parameter though (defaulting to true) -- I think we should do it!

Thought process

For users, this would preserve optionality while still putting forward default behavior we're assuming would be best for the greatest number of users.

generate_source already has an include_data_types (optional, default=False) parameter, so it would align those names (even though the default would differ). Maybe we should change the default for generate_source at the same time?

linbug commented 1 year ago

Thanks @dbeatty10 for the quick and detailed responses!

For the testing, I had a look through that previous PR that you linked. I think that adopting the same approach (using text_type_value and integer_type_value macros) is the simplest (and easiest to read) solution for now and would align with what has already been shipped for generate_source. This could always be refactored in the future if we find a better approach. I don't think that skipping tests is an option without removing the existing tests, although perhaps we could adopt something similar to the tests for generate_source, where we make a include_data_types parameter and set it to false in the all tests but one (perhaps generate_model_yaml). There isn't additional value to testing this functionality in more than one test anyway.

How does that sound?

I'm aligned on adding a include_data_types parameter and defaulting to true. I do think that it makes sense to have both generate_source and generate_model_yaml take the same default, and personally I'm in favour of this being true (as long as this doesn't break anything for existing users)! I'd be happy to update that in this or a separate PR.

linbug commented 1 year ago

Something else I just noticed is that in generate_source, data_type is uppercase whereas here I'm lowering. We should be consistent about this too. Any preference? According to the dbt style guide, we should default to lowercase.

dbeatty10 commented 1 year ago

@linbug Your proposals for the testing sound good -- let's do it as you proposed.

For the common include_data_types parameter within generate_source and generate_model_yaml, let's do the following:

default both to true
lowercase both to align with the dbt style guide

It's fine if the changes related to generate_source are made within this PR or within a separate PR -- up to you.

Since there will be changes to how generate_source behaves by default, let's make sure to update the README and also call this out in the changelog.

The current version of dbt_codegen is 0.9.0, and we'll make sure to bump the next version to 0.10.0 (or maybe even 1.0.0!) so these changes don't break anything for folks that have the 0.9.x series as an upper bound.

linbug commented 1 year ago

@dbeatty10 I've made those changes now, and signed the CLA. The PR is ready for review.

dbt-labs / dbt-codegen