NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
14 stars 3 forks source link

Enable unit tests using pyspark backend #55

Open jstammers opened 3 months ago

jstammers commented 3 months ago

This adds an option --backend for pytest that can be used to configure the Backend for ibis. Currently, this only supports the use of pyspark And many of the tests appear to be failing.

To make the CI tests clearer, we may want to define a separate workflow to run the tests with a pyspark backend, similar to how this is done for ibis

NickCrews commented 3 months ago

Pip installing pyspark is waaaaay simpler so I think we should probably start with that and see if we run into problems, but just keep the docker idea in the back of our minds

jstammers commented 3 months ago

When you pip install a pinned version of pyspark, do you know if you are only getting a pinned python layer version, or you also are getting a pinned version of the underlying java engine?

pyspark versions mirror the underlying spark version, so pyspark 3.5.2 is a python layer on top of spark 3.5.2 etc.

I'm not sure how backwards-compatible we would want to be with this, as in my experience there can be a few changes to the API between minor versions. Wherever possible, we would expect those to be handled within the ibis backend. I know that supports back to 3.3, depending on the function (e.g. pyarrow UDFs were implemented in pyspark 3.5), but I would expect most if not all of the built-in ibis functions to be supported in pyspark 3.3.

jstammers commented 1 month ago

@NickCrews thanks for the review on this. I've implemented most of the changes you've suggested. I'd like to make use of the notyet, never, etc. marks from ibis, but struggling to think of a way to do this that doesn't directly copy the conftest.py module from there.

Will let you know when this is ready for a second review

NickCrews commented 1 month ago

sorry @jstammers, I just switched the project manager to be uv instead of PDM, you're gonna get a few merge conflicts when you rebase :(

Oof, just took a look at the conftest.py's of the various backends from ibis. Open to what you come up with, but at first glance I don't think we need to be as thorough as they are. I think we want something equivalent (but probably lighter weight) to their BackendTest class, which would

Then, we create the BackendTest instance per pytest session using pytest fixtures. Let's look at how ibis does this, I bet they have this fairly worked out.