Closed chansooligans closed 1 year ago
I think I have a different idea on decoupling here: can we keep all implementation details in dedicated data storage/compute classes? That way, blocking or the other components don't need to know how data is stored or how to compute on it. Components just ask the compute class to do the computations.
So that way, folks can more easily extend to other data stores/compute engines e.g. spark or completely-in-memory pandas: define an abstract base class that lists the functions that need to be implemented (like size, join, sample etc). I can try coding up the framework today
yeah that's what i had in mind yesterday too with "decoupling"; i think for now, allowing sqlite compatibility will be fine and helpful for lighter implementation
But I agree with value of extending to other data stores -- I think current implementation can be extended to sparks, pandas sort of easily; As an example, with Blocking, oagdedupe.block.blocking depends on abstractions BaseForward, BaseConjunctions, BasePairs etc. We can update Forward with ForwardSQL, ForwardPandas, etc and same for the other classes.
Couple things to keep in mind:
ah remembered problem with sqlite compatibility -- some block schemes output arrays, e.g. ngrams
Goal here is to allow sqlite compatibility, which will allow in-memory setup with less overhead. Current abstractinos may be able to accommodate this.