anhaidgroup / py_entitymatching

BSD 3-Clause "New" or "Revised" License
185 stars 48 forks source link

efficiency of rule-based blocker #124

Open arunvellat opened 5 years ago

arunvellat commented 5 years ago

Hello,

It seems that the rule-based blocker iterates through all pairs of tuples from the two dataframe, which in effect, would imply a run-time of O ( N * M) where N and M are the number of rows in each table. Isn't the purpose of the blocker to reduce the candidate set size, to avoid computations over all pairs in the Cartesian product space ?

It would be great if you clarify which of the blockers iterate through all possible pairs, and which ones do not.

Thanks,

pjmartinkus commented 5 years ago

Yes, right now the rule based blocker will iterate through all tuples and one purpose of blocking is to reduce the candidate set size. There are three blockers which do not compare every tuple pair in the Cartesian product space: attribute equivalence blocker, overlap blocker, sorted neighborhood blocker. There are two blockers, however, that will compare every tuple pair in the Cartesian product space: rule-based and block box. In light of your question, we will be sure to add this information to the documentation in later releases of the package.

I would also like to go into a little more detail of why we still bother to include the rule based and block box blockers in this package despite them both making comparisons for all tuple pairs. While it is true that they iterate through the Cartesian product of tuples pairs, there are still two benefits to using these blockers. First, the rule based blocker is likely still faster than matching would be and this can still be a tool to reduce overall run time. However, I would generally recommend using the rule based blocker on either small data sets or to further block a candidate set that was created using a faster blocker. Second, the other purpose of blocking is to increase the proportion of matching tuple pairs in the candidate set. It is difficult to create a good training set when there are so few matches in any random sample you might take of the data. By using a rule based blocker, though the time benefits may be low or even counterproductive, you still gain the benefit of removing many non-matching tuple pairs to increase the proportion of matching pairs in the candidate set.

arunvellat commented 5 years ago

Thanks for the clarification. Does the objective of increasing proportion of matching pairs in the candidate set, and thereby making the training set more balanced, useful only in the development stage to select the optimal matcher? Would it be useful in the production stage as well, where, I assume, the optimal matcher selected from the development stage would be used ?

On a related note, I saw written in one of the recent papers from the group, that there exists is a how-to guide to productionise the whole Python pipeline with Spark. However, I could not find it on the website. Could you please point me to it ?

anhaid commented 5 years ago

Hi. Phil will answer the first part. For the second part, can I ask where you saw this thing about how-to guide to productionise the whole Python pipeline with Spark? Let me know and I can follow up.

Sorry for the delay on replying. We have been very busy commercializing CloudMatcher, which is the cloud based closed source part of this project. AnHai

pjmartinkus commented 5 years ago

To answer the first part, yes the increased proportion of matching pairs is mostly useful during the development stage. This helps you to more accurately judge the quality of the various matchers through their precision and recall. This is important because, if there are very few matches in the test set, then just one or two incorrect predictions can greatly affect the accuracy metrics and they are less likely to accurately reflect the true quality of the matcher. With a small proportion of matches, you would have to take a very large sample that would require lots of labeling in order to get enough true matches to accurately judge the ML matchers.

For the production stage, a higher proportion of matches in the candidate set is mostly useful at the end when you are trying to evaluate the final precision and recall of the matcher. At this point, you would need to again start taking some samples from the candidate set to label for evaluating the matcher's performance. For the same reasons as above, a higher proportion of matches means that you have less to label.