awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

pydeequ on Azure Databricks running profiler getting keyerror #148

Open dilkushpatel opened 10 months ago

dilkushpatel commented 10 months ago

Code: from pydeequ.profiles import *

result = ColumnProfilerRunner(spark) \ .onData(df) \ .run()

for col, profile in result.profiles.items(): print(profile)

Error: KeyError: 'StringColumnProfile'

KeyError Traceback (most recent call last) File :5 1 from pydeequ.profiles import * 3 result = ColumnProfilerRunner(spark) \ 4 .onData(df) \ ----> 5 .run() 7 for col, profile in result.profiles.items(): 8 print(profile)

Data Types is mostly string and decimal

iagochoa commented 10 months ago

Any update on this question? I'm facing the same problem here.

I'm using PyDeequ v 1.1.0, Spark 3.3.2 and com.amazon.deequ:deequ:2.0.4-spark-3.3 Lib.

Same issue found here on this StackOverflow link.

drewshiki commented 10 months ago

Same here, please help.

chenliu0831 commented 10 months ago

Can we get a minimal reproducing dataset? Perhaps just the schema and data type.

Awes35 commented 9 months ago

Can we get a minimal reproducing dataset? Perhaps just the schema and data type.

I am having this same issue. It only occurs for String datatype columns, numeric ones are fine. Here's a below example to reproduce the error:

import pandas as pd pd_df = pd.DataFrame({'col1':[1,2,3], 'col2':['a','b','c']}) tbl_df = spark.createDataFrame(pd_df)

from pydeequ.profiles import * result = ColumnProfilerRunner(spark) \ .onData(tbl_df) \ .run() for col, profile in result.profiles.items(): print(profile)

Yields the error: KeyError: 'StringColumnProfile'

I tried limiting it to certain columns using ColumnProfilerRunBuilder below:

from pydeequ.profiles import * result2 = ColumnProfilerRunBuilder(spark, tbl_df) \ .restrictToColumns(['col1','col2']) \ .run() for col, profile in result2.profiles.items(): print(profile)

If I only use col1, it performs fine, but col2 yields the same error as before.

See full error log below- errorlog.txt

julet85 commented 8 months ago

Any updates on this issue?

chenliu0831 commented 8 months ago

Running into this in the Spark 3.4 upgrade PR - https://github.com/awslabs/python-deequ/actions/runs/6645918957/job/18058255456?pr=168. Not sure about the root cause yet.

RobertasPetrauskas1 commented 3 months ago

Any updates on this ?

Update: Worked around this issue by downgrading from deequ 2.0.4 to 2.0.3.