Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.39k stars 170 forks source link

dataframe equality #2590

Open universalmind303 opened 4 months ago

universalmind303 commented 4 months ago

Is your feature request related to a problem? Please describe.

I'd like to be able to compare if dataframes are equal to one another.

import daft
import numpy as np
arr = np.arange(100)
df1 = daft.from_pydict({"a": arr})
df2 = daft.from_pydict({"a": arr, })

assert df1 == df2
# AssertionError

Describe the solution you'd like I think there's a few things to consider here. Since dataframes can either be loaded/unloaded we'd probably have to have some logic to check a few things before checking the actual values.

  1. Are they both loaded/unloaded
  2. Are the schemas equal
  3. Is any other metadata different?
  4. are the counts the same
  5. finally start comparing values.

I think using the __eq__ method is fine, but a .equals method would allow for more configuration such as null handling

df1.equals(df2)
df1.equals(df2, null_eq=True)

Describe alternatives you've considered manually compare dataframes.

jaychia commented 4 months ago

Any thoughts also on partitioning? They could contain the same data (and same order) globally, but partitioning might differ.

I feel like perhaps the safest option might just be to compare the logical plans...