sqlite compatibility - Githubissues

chansooligans / oagdedupe

Developed for Use by NY Office of the Attorney General: A Python library for scalable entity resolution, using active learning to learn blocking configurations, generate comparison pairs, then clasify matches

https://oagdedupe.readthedocs.io/en/latest/

MIT License

2 stars 1 forks source link

sqlite compatibility #103

Closed chansooligans closed 1 year ago

chansooligans commented 2 years ago

Goal here is to allow sqlite compatibility, which will allow in-memory setup with less overhead. Current abstractinos may be able to accommodate this.

block
- the array_agg and unnest functions are no longer needed, which would have caused problems
db
- all queries use sqlalchemy orm and should be sqlite compatible
distance
cluster

NYSAG-GS commented 2 years ago

I think I have a different idea on decoupling here: can we keep all implementation details in dedicated data storage/compute classes? That way, blocking or the other components don't need to know how data is stored or how to compute on it. Components just ask the compute class to do the computations.

NYSAG-GS commented 2 years ago

So that way, folks can more easily extend to other data stores/compute engines e.g. spark or completely-in-memory pandas: define an abstract base class that lists the functions that need to be implemented (like size, join, sample etc). I can try coding up the framework today

chansooligans commented 2 years ago

yeah that's what i had in mind yesterday too with "decoupling"; i think for now, allowing sqlite compatibility will be fine and helpful for lighter implementation

But I agree with value of extending to other data stores -- I think current implementation can be extended to sparks, pandas sort of easily; As an example, with Blocking, oagdedupe.block.blocking depends on abstractions BaseForward, BaseConjunctions, BasePairs etc. We can update Forward with ForwardSQL, ForwardPandas, etc and same for the other classes.

Couple things to keep in mind:

the blocking functions are sql functions, like "SELECT first_2_nchars(name)", making it difficult to separate implementation and storage since it's a SQL query
keeping parallelization smooth

chansooligans commented 2 years ago

ah remembered problem with sqlite compatibility -- some block schemes output arrays, e.g. ngrams