anhaidgroup / py_entitymatching

BSD 3-Clause "New" or "Revised" License
183 stars 48 forks source link

Allow different implementations of feature extraction. #148

Closed christiemj09 closed 3 years ago

christiemj09 commented 3 years ago

Currently, the function py_entitymatching.feature.extractfeatures.extract_feature_vecs() relies on a single implementation of feature extraction that splits the candidate set into chunks and extracts features from each chunk in parallel. This works well in many situations, though feature extraction can take hours with 4-8 cores on a candidate set with tens of millions of rows and tens of features.

To give users some flexibility and control over how features are extracted, this PR does the following:

christiemj09 commented 3 years ago

Alternate implementations are left as future work, and may be provided outside py-entitymatching.