NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
14 stars 3 forks source link

Unable to sample when max_pairs is greater than or equal to n_possible_pairs #63

Closed jstammers closed 1 month ago

jstammers commented 1 month ago

I've been encountering errors when trying to sample pairs of blocked records if the number of pairs is not a 'nice' integer. This seems to be related to the default max_pairs value

import ibis
from mismo.block._util import sample_all_pairs
table = ibis.memtable({"record_id":[1,2,3]})
sample_all_pairs(table, table, max_pairs=1_000_000) # as used in mismo.fs.train_using_labels

Which gives the following error

File ~/Work/****/.venv/lib/python3.10/site-packages/mismo/block/_util.py:90, in sample_all_pairs.<locals>.make_pair_ids()
     89 def make_pair_ids():
---> 90     pair_ids = ibis.range(n_pairs).unnest().as_table()
     91     return pair_ids.select(
     92         __left_id=(ibis.random() * left.count()).floor().cast(\"int64\"),
     93         __right_id=(ibis.random() * right.count()).floor().cast(\"int64\"),
     94     )

File ~/.pyenv/versions/3.10.14/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
    885 if not args:
    886     raise TypeError(f'{funcname} requires at least '
    887                     '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)

TypeError: range() missing 2 required positional arguments: 'stop' and 'step'"

when max_pairs This could be an upstream issue, as I'm not sure why stop and step would be required. I don't see an error if the default value of None is used

sample_all_pairs(table, table)

If I modify https://github.com/NickCrews/mismo/blob/a62f4fa5ccb91b2f51f035ddfa896a2cd5089b77/mismo/block/_util.py#L90 to explicitly define the range

        pair_ids = ibis.range(0, n_pairs, 1).unnest().as_table()

or cast n_possible_pairs as an int, that seems to resolve this issue

NickCrews commented 1 month ago

Fixed by #65 , thank you @jstammers !

NickCrews commented 1 month ago

CC @cpcloud, here is a little papercut in ibis where ibis.range(<float>) gives the cryptic TypeError: range() missing 2 required positional arguments: 'stop' and 'step'" error. Probably not worth changing anything, but if it's an easy fix I can file an issue with ibis.