fivetran / dbt_fivetran_log

Data models for Fivetran's internal log connector built using dbt.
https://fivetran.github.io/dbt_fivetran_log/
Apache License 2.0
30 stars 25 forks source link

[Bug] Error parsing JSON in model for auditing user activity #105

Closed andysteinitz closed 9 months ago

andysteinitz commented 10 months ago

Is there an existing issue for this?

Describe the issue

This is similar to https://github.com/fivetran/dbt_fivetran_log/issues/46 The use of fivetran_utils.json_parse in the release of https://github.com/fivetran/dbt_fivetran_log/pull/98 in v1.3.0 dbt_fivetran_log does not account for malformed json

https://github.com/fivetran/dbt_fivetran_log/blob/78c4398d9b8f55e79b88c151d600b7c64c56ceb3/models/fivetran_platform__audit_user_activity.sql#L3-L6

parse_json results in the error Error parsing JSON: unknown keyword "skipper_breech", pos 15. "skipper_breech" is the id of a long obsolete Transformation in our logs and it looks like message_data values from before 2021-05-12 were not structured as json:

skipper_breech([redacted destination]) run=37036084 Transformation run has succeeded. Result: The transformation has completed successfully.

using try_parse_json allows the query to complete successfully.

Relevant error log or model output

The following will result in `Error parsing JSON: unknown keyword "skipper_breech", pos 15`

with logs as (

    select 
        *,

  parse_json( message_data )['actor'] as actor_email
    from transformed.stg_fivetran_platform.stg_fivetran_platform__log
)

, user_logs as (

    select *
    from logs
    where actor_email is not null 
        and lower(actor_email) != 'fivetran'
)

SELECT * FROM user_logs;

Expected behavior

Values not stored as json are bypassed instead of triggering an error. In snowflake we can use try_json_parse

dbt Project configurations

models: fivetran_log: schema: fivetran_platform staging: schema: stg_fivetran_platform

fivetran_log:
    fivetran_platform_database: RAW
    fivetran_platform_schema: FIVETRAN_LOG
    fivetran_platform_using_transformations: false # this will disable all transformation + trigger_table logic
    fivetran_platform_using_triggers: false # this will disable only trigger_table logic 
    fivetran_platform_using_account_membership: false # this will disable only the account membership logic
    fivetran_platform_using_destination_membership: false # this will disable only the destination membership logic
    fivetran_platform_using_user: false # this will disable only the user logic

Package versions

packages:

What database are you using dbt with?

snowflake

dbt Version

1.7

Additional Context

No response

Are you willing to open a PR to help address this issue?

fivetran-catfritz commented 10 months ago

Hi @andysteinitz thank you for opening this issue, and thanks for calling out the try_parse_json solution! I was able to reproduce the issue and found a fix that was working on my end. I swapped fivetran_utils.json_parse() for fivetran_utils.json_extract() since fivetran_utils.json_extract uses try_json for Snowflake. We typically use the json_parse version since you can specify a path, but in this particular case we only need to go one level.

I have created a test branch--would you be able to test that it works correctly for you? You can install the test branch using the below snippet.

  - git: https://github.com/fivetran/dbt_fivetran_log.git
    revision: test/try-json
    warn-unpinned: false

Let me know how it works for you!

andysteinitz commented 10 months ago

Hi @fivetran-catfritz yes, this does allow the model to succeed. Unfortunately I still can't install the new package because old log entries are causing the dependent test unique_fivetran_platform__audit_user_activity_log_id to fail. Our raw fivetran_platform.log data appears to contain connector_id values in the id field prior to April 2022 (about the same time as https://github.com/fivetran/dbt_fivetran_log/issues/46 was resolved). In cases where connector_id was null, it looks like the log populated id with "system". Let me know how best to proceed with sharing those details here or elsewhere like in a fivetran support ticket.

fivetran-catfritz commented 10 months ago

Hi @andysteinitz thanks for testing that out! Looks like my test branch crossed paths with the latest version release, and the my branch missed a test update for that model. I updated the uniqueness test to a combination of log_id and occurred_at. If you run dbt deps again and rerun the tests, does it pass for you now?

andysteinitz commented 10 months ago

Thank you, confirming the test dbt_utils_unique_combination_of_columns_fivetran_platform__audit_user_activity_log_id__occurred_at passes for me.

Side note fyi that the new dbt deps process with package-lock.yml doesn't seem smoothed out. I had to hard delete my local package-lock to get the update to reflect in my local repo. dbt clean / dbt deps alone wasn't achieving the update, nor was dbt deps --lock

fivetran-catfritz commented 10 months ago

@andysteinitz Thanks for confirming and the additional information! We've also noticed hard deleting the package-lock seems to be required at times, so it's interesting to know you're experiencing this as well.

As for updating the package, we do plan to do make a fix for the issue you're seeing, however we realized we'll have to do a bit more work to find the fixes for all the warehouses we support. I will leave the test branch up if you wish to use that in the meantime and will let you know when we know more about the timing. Also I will be taking some time off starting next week, but another member of my team should be able to help you if you have any questions in the meantime!

fivetran-joemarkiewicz commented 10 months ago

Hi @andysteinitz I wanted to chime in as I think I may have found a more scalable solution when accounting for the behavior across warehouses. If you have availability, would you be able to attempt running the following version of the package and let us know if this results in a successful run.

packages:
  - git: https://github.com/fivetran/dbt_fivetran_log.git
    revision: bugfix/json-actor-filter
    warn-unpinned: false

Essentially, this branch includes a change similar to the initial fix from the other ticket you linked. We are filtering out the records we do not care about. This should ensure the JSON parsing is only happening on tickets that have the proper data to be extracted.

Let us know if this works. If so, we will likely move forward with this approach.

andysteinitz commented 10 months ago

Hello again @fivetran-joemarkiewicz Yes, your version revision: bugfix/json-actor-filter also completed successfully. You can proceed with either solution.

fivetran-joemarkiewicz commented 9 months ago

That's great to hear! Thanks for confirming. I will start the formal PR process with the change I just recently suggested. You can likely expect an official release after the holidays.

fivetran-joemarkiewicz commented 9 months ago

This fix has since been included in the main branch and is live within the v1.4.1 release of the dbt_fivetran_log package. Thanks again @andysteinitz for bringing this to our attention and for helping validate the change before release.

As the issue has been addressed, I will close out this ticket. Please feel free to reopen the issue or create a new one if you are still experiencing a the same or similar issue following the upgrade.