Closed marrov closed 2 weeks ago
Thanks @marrov for finding this shortcoming, and also for your engagement with this project. Appreciate it. A tuple with both dataframe
instances feels like a good solution.
To simplify the validation and harness the return function, I could think of an alternative approach, that uses str
instead of the class. Then the comparison would not require the import of the ConnectDataFrame
and instead simply operate with type
, str
and re
potentially.
Because we are working in the new cuallee v1.0
bringing refactor, less code, increased score for maintenance, fresh docs, videos, and machine learning enabled checks...., anyway an example of what I mentioned above... you can find in a working branch feature-dtype
that removes all the type inferences in the main __init__.py
file.
As an example, the compute engines are now substantially less complex to understand via:
self.dtype = first(re.match(r".*'(.*)'", str(type(dataframe))).groups())
match self.dtype:
case self.dtype if "pyspark" in self.dtype:
self.compute_engine = importlib.import_module("cuallee.pyspark_validation")
case self.dtype if "pandas" in self.dtype:
self.compute_engine = importlib.import_module("cuallee.pandas_validation")
case self.dtype if "snowpark" in self.dtype:
self.compute_engine = importlib.import_module("cuallee.snowpark_validation")
case self.dtype if "polars" in self.dtype:
self.compute_engine = importlib.import_module("cuallee.polars_validation")
case self.dtype if "duckdb" in self.dtype:
self.compute_engine = importlib.import_module("cuallee.duckdb_validation")
case self.dtype if "bigquery" in self.dtype:
self.compute_engine = importlib.import_module("cuallee.bigquery_validation")
case self.dtype if "daft" in self.dtype:
self.compute_engine = importlib.import_module("cuallee.daft_validation")
case _:
raise NotImplementedError(f"{self.dtype} is not yet implemented in cuallee")
what do you think?
Fixed in #315
Thank you for an amazing data quality library! It is amazing that you fixed this so quick, I'll remove my fork and close this issue. I actually did not think of solving isinstance
checks with str
and re
but that seems like a great lightweight approach!
Describe the bug I have been using
cuallee
's pyspark API for some time, but I came across an issue that is, admittedly notcuallee
's fault but limits applicability in my usecase.The gist of it is that if you are running
databricks-connect
to access data in a Databricks environment through the pyspark API, the return type of pyspark DataFrames is not the same as the when using regular spark:<class 'pyspark.sql.dataframe.DataFrame'>
<class 'pyspark.sql.connect.dataframe.DataFrame'>
This causes the following error when a custom function is run:
Note that this is actually not an issue with built-in checks, I have not checked why. A simple solution to this would be to check for the tuple of types:
I have made a fork of the main branch and added this change (here)[https://github.com/marrov/cuallee/tree/bug/connect-dataframe] as I really need this feature but I will not start a PR until it is clear that this is a functionality that the devs are ok with supporting.
To Reproduce Steps to reproduce the behavior:
Expected behavior The validation should not fail for a
pyspark.sql.connect.dataframe.DataFrame
Desktop: