Closed jstammers closed 3 months ago
Yup, looks like spark doesnt support uints of any kind.
I just switched from uint64s to int64s. Let me know if you still run into this problem on main. I imagine you might in other places throughout mismo.
Also curious, are you just trying out pyspark out of curiosity, or is duckdb not cutting it? What is the size of your task? If you wanted to add test infrastructure to test everything on spark I would really appreciate it, and it would really improve the odds I won't break you in the future (as it is I'm only worrying about duckdb)
Thanks for resolving this, I'll test it out and let you know if I have any issues.
For the most part, I'm using duckdb as that can fairly comfortably handle 10Ms of blocked pairs in my experience. I hae occasionally swapped to pyspark when trying to look for dupes in datasets of ~5M records and blocking on a fairly coarse feature (e.g. zipcode). This is mainly due to a lack of domain-knowledge on a good set of blocking rules, which could probably be mitigated with some EDA.
I'd be happy to add some test infrastructure to support testing using a spark backend. I have some future ER projects in mind that could benefit from a distributed backend due to the number of records we have, so it would be worth ensuring that spark is fully supported by mismo
I'm not able to cluster a pyspark dataframe due to an unsupported data type.
The error is
which makes me think that the conversion from an ibis datatype to a pyspark one is not correct.
The full traceback is
From looking into this, I think it's because pyspark doesn't supported unsigned integers. Whether we'd need the full-width is a different question