There are a number of relatively small improvements that we would like to make to Dedupe, each of which requires a little bit of research. These improvements include:
Adjusting the clustering library to allow use of 32-bit floats instead of 64-bit
Replacing the connected components algorithm with one that uses less memory
Creating a performance testing suite
This represents the first and easier half of #60.
Proposal
I propose to make incremental contributions to the Dedupe core library as a way of becoming more familiar with the library internals and developing my knowledge of C.
Deliverables
I plan to merge pull requests into Dedupe core, one for each item above.
Timeline
I expect to take roughly two months (four R&D days) to complete the issues above. The issues are in order of least to most complexity, and I expect them to take a roughly proportional amount of time.
Incremental Improvements to Dedupe Core
Background
There are a number of relatively small improvements that we would like to make to Dedupe, each of which requires a little bit of research. These improvements include:
This represents the first and easier half of #60.
Proposal
I propose to make incremental contributions to the Dedupe core library as a way of becoming more familiar with the library internals and developing my knowledge of C.
Deliverables
I plan to merge pull requests into Dedupe core, one for each item above.
Timeline
I expect to take roughly two months (four R&D days) to complete the issues above. The issues are in order of least to most complexity, and I expect them to take a roughly proportional amount of time.