datahub-project / datahub

The Metadata Platform for your Data and AI Stack
https://datahubproject.io
Apache License 2.0
9.94k stars 2.95k forks source link

pattern_add_dataset_dataproduct works for Oracle ingestion but not S3 #11656

Open mikeburke24 opened 1 month ago

mikeburke24 commented 1 month ago

Describe the bug We are trying to automatically assign data products to datasets and their container during ingestion from S3. I have included the format of our transformer below:

To Reproduce

transformers:
    -
        type: pattern_add_dataset_dataproduct
        config:
            is_container: true
            dataset_to_data_product_urns_pattern:                
                rules:
                    '.*': 'urn:li:dataProduct:<DATA_PRODUCT_URN>'

However, the ingestion fails with the following message: Failed to configure transformers: 1 validation error for PatternDatasetDataProductConfig is_container

extra fields not permitted (type=value_error.extra) If we remove the is_container portion, the ingestion still fails with the message below: ERROR :: /assets/0/destinationUrn :: field is required but not found and has no default value

Expected behavior The documentation that you linked states that is_container is supported:

Additional context This transformer format works fine for Oracle (if is_container is removed) but doesn't work for S3

hsheth2 commented 1 month ago

@mikeburke24 Looks like due to https://github.com/datahub-project/datahub/pull/10928, you probably want to be on server 0.14.1 and a CLI version that is 0.14.1.x.

That should solve both the is_container config issue and the error during emission.

mikeburke24 commented 3 weeks ago

@hsheth2 Hi, I've upgraded to GMS tag [1f02c84] and CLI 0.14.1.3 and still we are getting this error

ERROR :: /assets/0/destinationUrn :: field is required but not found and has no default value

It does work for Oracle though. Do you have any ideas why it doesn't work for S3? Would you have any example syntax that might work?

jjoyce0510 commented 2 weeks ago

Interesting, it looks lke we need to investigate the pattern_add_dataset_dataproduct transformer a bit more closely to determine why it would not be providing this field.

mikeburke24 commented 2 weeks ago

@jjoyce0510 thanks John! If you've ever got this to work or have any other example syntax please send it my way. I'm not sure what field it is looking for that it can't find. Here's an example I've tried on a local build

transformers:

    type: pattern_add_dataset_dataproduct
    config:
        dataset_to_data_product_urns_pattern:
            rules:
                '.*': 'urn:li:dataProduct:xxxxxxxx'
asikowitz commented 1 week ago

Can you post your full S3 recipe (redacted)? It seems like we have some bug where we emit an invalid MCP but I'm having trouble narrowing it down.

mikeburke24 commented 1 week ago

@asikowitz sure

source:
    type: s3
    config:
        path_specs:
            -
                include: 's3://<mybucket>/<myfile.csv>'
transformers:
    -
        type: pattern_add_dataset_dataproduct
        config:
            is_container: true
            dataset_to_data_product_urns_pattern:
                rules:
                    'urn:li:dataset:(urn:li:dataPlatform:s3,<mybucket>/<myfile.csv>,prod)': 'urn:li:dataProduct:<urn>'