awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

Odd Behavior for AnalysisRunner.useRepository(repository).saveOrAppendResult(resultKey) #147

Open psyking841 opened 10 months ago

psyking841 commented 10 months ago

Describe the bug Scenario 1 - where repository = FileSystemMetricsRepository(spark, "s3://bucket/run=0/"), it generates a file named "run=0". See red box in below Snapshot 1.

Scenario 2 - where repository = FileSystemMetricsRepository(spark, "s3://bucket/run=0/metrics.json"), it now generates correctly a folder named run=0 as the green box in Snapshot 1. But in Snapshot 2, it does not create the metrics.json file. In Snapshot, it generated 3 files with UUID as file names, each corresponds to a tag.

Are these expected behavior?

To Reproduce Steps to reproduce the behavior:

  1. Create repository using above code snippets
  2. Create AnalysisRunner with above repository code snippets, and write to S3.

Expected behavior In Scenario 1, I would expect Pydeequ lib to write 3 files under the run=0/ folder. In Scenario 2, I would expect Pydeequ lib to write one file under the run=0/ folder.

Screenshots

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

psyking841 commented 10 months ago

I got scenario 2 working now, i.e., repository = FileSystemMetricsRepository(spark, "s3://bucket/run=0/metrics.json") can append metrics to a single file named metrics.json. But I am still not sure why it was not working for me before.