Open dilkushpatel opened 10 months ago
Any update on this question? I'm facing the same problem here.
I'm using PyDeequ v 1.1.0, Spark 3.3.2 and com.amazon.deequ:deequ:2.0.4-spark-3.3 Lib.
Same issue found here on this StackOverflow link.
Same here, please help.
Can we get a minimal reproducing dataset? Perhaps just the schema and data type.
Can we get a minimal reproducing dataset? Perhaps just the schema and data type.
I am having this same issue. It only occurs for String datatype columns, numeric ones are fine. Here's a below example to reproduce the error:
import pandas as pd
pd_df = pd.DataFrame({'col1':[1,2,3], 'col2':['a','b','c']})
tbl_df = spark.createDataFrame(pd_df)
from pydeequ.profiles import *
result = ColumnProfilerRunner(spark) \
.onData(tbl_df) \
.run()
for col, profile in result.profiles.items():
print(profile)
Yields the error: KeyError: 'StringColumnProfile'
I tried limiting it to certain columns using ColumnProfilerRunBuilder below:
from pydeequ.profiles import *
result2 = ColumnProfilerRunBuilder(spark, tbl_df) \
.restrictToColumns(['col1','col2']) \
.run()
for col, profile in result2.profiles.items():
print(profile)
If I only use col1, it performs fine, but col2 yields the same error as before.
See full error log below- errorlog.txt
Any updates on this issue?
Running into this in the Spark 3.4 upgrade PR - https://github.com/awslabs/python-deequ/actions/runs/6645918957/job/18058255456?pr=168. Not sure about the root cause yet.
Any updates on this ?
Update: Worked around this issue by downgrading from deequ 2.0.4 to 2.0.3.
Code: from pydeequ.profiles import *
result = ColumnProfilerRunner(spark) \ .onData(df) \ .run()
for col, profile in result.profiles.items(): print(profile)
Error: KeyError: 'StringColumnProfile'
KeyError Traceback (most recent call last) File:5
1 from pydeequ.profiles import *
3 result = ColumnProfilerRunner(spark) \
4 .onData(df) \
----> 5 .run()
7 for col, profile in result.profiles.items():
8 print(profile)
Data Types is mostly string and decimal