fivetran / dbt_fivetran_log

Data models for Fivetran's internal log connector built using dbt.
https://fivetran.github.io/dbt_fivetran_log/
Apache License 2.0
30 stars 23 forks source link

[Bug] Version 1.1.0 Breaking in Postgres #91

Open bcutler2919 opened 9 months ago

bcutler2919 commented 9 months ago

Is there an existing issue for this?

Describe the issue

When trying to run the new package version in Postgres I'm receiving errors when running dbt run --select fivetran_log --full-refresh

Relevant error log or model output

Error in model stg_fivetran_platform__log (models/staging/stg_fivetran_platform__log.sql)
18:59:38    operator does not exist: json ~~ unknown
18:59:38    LINE 118: ...en transformation_id is not null and message_data like '%has...
18:59:38                                                                   ^
18:59:38    HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.
18:59:38    compiled Code at target/run/fivetran_log/models/staging/stg_fivetran_platform__log.sql

Expected behavior

I would expect all staging and production models to be created with no issue.

dbt Project configurations

Name: 'project'
version: '1.0.0'
config-version: 2
profile: 'default'

vars:
  # this will disable all transformation + trigger_table logic
  fivetran_platform_using_transformations: false
  # this will disable only trigger_table logic
  fivetran_platform_using_triggers: false
  # this will disable only the destination membership logic
  fivetran_platform_using_destination_membership: false

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"  # directory which will store compiled SQL files
clean-targets:         # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"

Package versions

packages:
  - package: fivetran/fivetran_log
    version: 1.1.0

What database are you using dbt with?

postgres

dbt Version

dbt version: 1.6.5 dbt-postgres version: 1.6.5

Additional Context

No response

Are you willing to open a PR to help address this issue?

fivetran-joemarkiewicz commented 9 months ago

Hi @bcutler2919 thanks for opening this issue. Did the package work for you previously? For example, does an earlier version of the package work for you?

I mainly ask because it seems we have not made any changes to that code for the last 3 years πŸ€”

image

bcutler2919 commented 9 months ago

Yes, it did. I think the issue is coming from Fivetran's update to the log connector. Previously that column was showing up as text in our data warehouse and it's now showing up as json

fivetran-joemarkiewicz commented 9 months ago

Thanks for confirming @bcutler2919. With that, we will want to update the package then to account for this datatype change in the connector. I was experimenting and believe a solution would be to just convert the datatype back to a string so we can perform a where like on the field.

Would you be able to attempt to try the following version of the package and let me know if this error is resolved.

packages:
  - git: https://github.com/fivetran/dbt_fivetran_log.git
    revision: fivetran-joemarkiewicz-patch-2
    warn-unpinned: false 
bcutler2919 commented 9 months ago

That seemed to work but the fivetran_platform__audit table has nothing in it. Looking at the stg_fivetran_platform__log table, event_subtype looks like a json string Screenshot 2023-10-06 at 10 08 47 AM

but previously it was more of a categorical column. Was this the intended behavior? image

fivetran-joemarkiewicz commented 9 months ago

@bcutler2919 my appologies I had done a copy error when making the updates in that branch and replaced message_event with message_data resulting in the strange behavior you are seeing.

I just pushed changes to the branch to fix this copy error. When you have a change could you rerun the models and let me know if the results are what you are expecting.

kevinmalloy commented 9 months ago

This may be more of an issue on the fivetran application side of things but I noticed the fivetran_log.log table that Fivetran is populating has the message_data in a stringified json format.

You can see it a bit in @bcutler2919's screenshot.

To get the database to properly recognize the field as json I had to first format it:

select 
replace(trim('"' from (message_data)::text), '\"', '"')::json #>> '{table}' as fixed_extract,
message_data::jsonb #>> '{"table"}' as legacy_extract
from "fivetran"."analytics_data_stg_fivetran_platform"."stg_fivetran_platform__log"
where event_subtype in ('sync_start', 'sync_end', 'write_to_table_start', 'write_to_table_end', 'records_modified');

Not sure if that's something that needs to be taken care of through dbt or not.

bcutler2919 commented 9 months ago

@fivetran-joemarkiewicz I tried rerunning the fivetran_log project Friday and was unable to get the project to complete the run. I let the models go for 4+ hours and it got hung up on the fivetran_platform__audit_table. To @kevinmalloy-untitled's point, even if the models had completed, I'm not so sure that table would have populated correctly. I'm not sure what's causing the hang up though? Could it be the conversions to text in stg_fivetran_platfrom__log and then the json_parsing in fivetran_platform__audit_table?

fivetran-joemarkiewicz commented 9 months ago

@bcutler2919 thanks for getting back and letting us know you were seeing the models run for quite some time. I wouldn't have assumed the conversion of the json to text would have resulted in much time πŸ€”. Are you able to see how long the stg_fivetran_log__log table took? Also, how long did the audit table used to take when it compiled properly?

fivetran-joemarkiewicz commented 9 months ago

Let me also confirm with our product team on @kevinmalloy-untitled point above to see if this stringified json is expected in the source table.

bcutler2919 commented 9 months ago

The stg_fivetran_platform__log table took ~93 seconds to complete and about ~105 seconds (for the stg_fivetran_log__log table) prior to the new release. The audit table varies a lot since it's an incremental model but I'm seeing anywhere from less than a second to 83 seconds. I assumed the model would take longer from doing a full refresh (per the release notes), but 4 hours still seems excessive.

fivetran-joemarkiewicz commented 9 months ago

Thanks for sharing @bcutler2919! Yeah it would take longer with the full refresh, but I still would not expect it to get hung up for 4+ hours. I am still following up with our product team to understand the stringify version of the json field, but in the meantime I may edit the branch I sent over to remove the json parsing to confirm if that is in fact the issue why the model is getting hung up.

fivetran-joemarkiewicz commented 9 months ago

@bcutler2919 if you have a moment I would be curious if you find the below branch takes as long as the previous to run. The main change is that I removed the json parsing logic to just return a string or int. If this takes significantly less time then that really narrows in the issue to the format of the json records.

packages:
  - git: https://github.com/fivetran/dbt_fivetran_log.git
    revision: fivetran-joemarkiewicz-patch-3
    warn-unpinned: false 
bcutler2919 commented 9 months ago

@fivetran-joemarkiewicz I'm getting an error on the audit table saying:

image

fivetran-joemarkiewicz commented 9 months ago

Hi @bcutler2919 apologies as I didn't see this causing an error on my side originally. I just substituted the string for a null and it seems to have compiled properly on my side. Would you be able to give it another try when you have a moment?

Additionally, I have been able to follow up with our product team and they confirmed that the JSON format you are seeing is not expected and is something they will want to investigate and likely update via the connector. Unfortunately, these GitHub issues are only intended for dbt package updates. In order for our connector team to properly work through this underlying data discrepancy we would appreciate if you are able to create a Fivetran Support Ticket detailing the JSON format issue you are seeing and how it is causing issues in your downstream transformations.

I would also include in your ticket that you have already attempted to troubleshoot the issue with the dbt package team. This will help them move the ticket properly to the connector team. Let me know if you have any questions on next steps. Thank you!

bcutler2919 commented 9 months ago

@fivetran-joemarkiewicz The model was still running longer than 4 hours 😞 I did submit a support ticket to Fivetran though. Thanks for being so helpful!

fivetran-joemarkiewicz commented 9 months ago

Sorry to hear that it was still running for more than 4 hours. That does still seem excessive πŸ€” even with the removal of the JSON parsing. Thank you for opening the support ticket. I will mark this as won't fix for the time being unless there are some findings from the eng ticket that inform us on changes that should be applied to the package.