I'm going to write up a more detailed issue about this on Friday, but I'm leaving this as a quick note that I plan to work on Dedupe core issues in roughly the following sequence:
Allowing 32-bit floats instead of 64-bit doubles in fastcluster
Improving the connected component search algorithm to make it less memory-intensive
Defining a test harness for testing different performance metrics
Using blocks as a feature for the classifier
Researching different approaches to sampling record pairs for active labelling
Researching different learning routines (connects #55)
I'm going to write up a more detailed issue about this on Friday, but I'm leaving this as a quick note that I plan to work on Dedupe core issues in roughly the following sequence: