NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
13 stars 3 forks source link

Testing: test_fs is too computationally and memory intensive #21

Closed OlivierBinette closed 8 months ago

OlivierBinette commented 9 months ago

Assigning this to myself @olivierbinette.

Pytest crashes on test_fs.py using my default Github Codespace.

I'd recommend:

Furthermore, the unit tests should be at a more granular level, e.g., testing train_ms_from_labels and train_us_using_sampling.

General notes (for myself and share how I'm trying to work): for the tests to not come in the way of development and rather help speed up development, it's important to treat them as testing an API specification rather than testing implementation details. It's also a good idea to come up with tests before and during development, so that code is easily testable, and so that development can benefit from the use of tests. It's common to do development interactively in a python notebook, before moving re-usable code in a pacakge. When doing that, we always define an example and then check results as we develop. These interactive examples/tests should be moved to unit tests as well. To help achieve that, the ipytest package helps formalize the tests/examples we use while doing interactive development in a notebook.

NickCrews commented 9 months ago

I agree all (default) tests should be fast and should work on a low-resource machine. I think it is fine to have some tests that are slow/intense, but they should be flagged behind pytest markers and not run by default, only in CI (I feel like if we only run them manually, they inevitably won't get run :) but maybe I'm a pessimist ) or if explicitly asked for on a dev machine.

I agree that there should be more granular tests, I would love to see those. I think an e2e test is good to have too though, perhaps it could be a different form factor that this though, like maybe we run through all of the example notebooks and check they don't crash.

I'm confused why that test is crashing things though. That PATSTAT dataset is only 2379 records. It should only be generating

  1. 10k max pairs for sampling
  2. however many true pairs there are based on labels (shouldn't be that large???) so if it's crashing this makes me think my assumptions here are wrong, and this should get looked into.
NickCrews commented 9 months ago

from debugging that test it does only look like 10k pairs are generated. But maybe there are some transient memory spikes when materializing these 10k that run you out of memory?

I just bumped the max_pairs down to 1k, does that fix things for you?

OlivierBinette commented 9 months ago

@NickCrews On a default Github Codespace, the pytest process still crashes at the train_comparison() call of test_fs, even if I go down to 10 pairs.

Not too sure what's going on. My setup is:

  1. Launch a Codespace on this repo.
  2. Install pdm: curl -sSL https://pdm-project.org/install-pdm.py | python3 -
  3. pdm install
  4. pdm run pytest
NickCrews commented 9 months ago

Do you have any logs? I want to confirm that this is an out of memory error before assuming that and diving in.

NickCrews commented 9 months ago

Or I guess if it's that simple I can try reproducing it myself

OlivierBinette commented 9 months ago

@NickCrews I don't get any error log. The Python process just stops and it could be something else than out of memory. I'll try to look at it more closely in a few days as well.

NickCrews commented 9 months ago

See that linked commit, if you now use main that has that commit, do you see the problem then?

I just tried your repro steps in my own codespace, and I see no crash/error.

OlivierBinette commented 8 months ago

@NickCrews I no longer get a crash. The test suites run fully with no errors (except the xfail one):

Screenshot 2024-01-04 130612