codeforamerica / dev

Discuss general civic development topics here. Watch this repo to follow the conversation.
4 stars 3 forks source link

[GSoC 2013] Interested in extending dedupe and working on Automated Data Matching #6

Open nilesh-c opened 11 years ago

nilesh-c commented 11 years ago

Hi Mick @dthompson ,

I am a 3rd year undergraduate student of computer science, pursuing my B.Tech degree at RCC Institute of Information Technology. I am proficient in Java, PHP and C#. I am familiar with Python and I'm brushing it up for this project.

Among the project ideas on the GSoC 2013 ideas page, the one particular idea that seemed really interesting to me is "Automated Data Matching". I want to work on it. I also cloned the dedupe repo and played around with the examples and the API - it seems really interesting, and the wiki documentation is excellent. I'm planning on extending it with more string metrics and matching algorithms too.

I am passionate about data mining, big data and recommendation engines, therefore this idea naturally appeals to me a lot. I have experience with building music and people recommendation systems, and have worked with Myrrix and Apache Mahout. I recently designed and implemented such a recommendation system and deployed it on a live production site, where I'm interning at, to recommend/match Facebook users to each other depending upon their interests and statuses - it uses semantic matching by mining wikipedia data dumps.

I think it's a good idea to build the automated matcher using dedupe (and any other open source libraries if needed) followed by simple REST-based and command-line APIs for it. The basic functionality for the matcher should be pretty easy to set up considering the already existing code in dedupe. I would like to ask a few questions - these clarifications will help me a lot to prepare for what needs to be done.

  1. Apart from building the automated data matcher, I also intend to work on Dedupe Issue #16 and implement new string similarity metrics like Levenshtein distance, Damerau-Levenshtein distance, Cosine Similarity, soundex etc. Do you think it is inside the scope of a GSoC project this summer? I think it'll make the matcher more intelligent with a wide array of options.
  2. As far as I understand, currently dedupe has no parallel processing capabilities. There is even an issue for that at Dedupe Issue #111. I think I can try adding some parallel processing, maybe integrate the code with hadoop and map-reduce. What are your thoughts on this?
  3. Given my interest for multiple closely related ideas, aiming to complete everything within the time constraints of the summer for GSoC might not be feasible. Should I include all my plans in the proposal (if I continue my contributions beyond the summer) or should I confine my proposal to only the ones I intend to deliver for the summer?

Please share your views and ask me if you have any questions. Looking forward to hearing from you very soon. :-)

Cheers, Nilesh

mick commented 11 years ago

@nilesh-c

Really like your enthusiasm for this project, I think it could be very useful.

I think that is great how you are already thinking about how you might contribute to dedupe! For your proposal you should be clear about what you plan to do this summer as GSoC, and make sure that is feasible. Try to be specific about what you plan on working on.

fgregg commented 11 years ago

Wow! This would be very exciting for us, the dev team behind dedupe. Mark Huberty has implemented a bunch of edit distances in the dev branch, but we can definitely use more. Parallel processing would be really great and feasible to tackle over the summer (IMHO).

nilesh-c commented 11 years ago

Thank you @dthompson and @fgregg! I'm pretty excited. Sorry about the delay, I got caught up with university lab examinations and assignments. I'll submit my initial proposal via melange in a day and let you know in this thread, so that you can give some feedback over how I can improve it.

nilesh-c commented 11 years ago

@dthompson, I have a question regarding the proposal template - is it preferred if I keep the template structure just as it is (like on this page - https://google-melange.appspot.com/gsoc/org/google/gsoc2013/codeforamerica) and "fill in" my details? Or, should I take the liberty of modifying it a bit here and there, beautifying it, of course ensuring that all the information that is asked is properly provided?

nilesh-c commented 11 years ago

@fgregg Please check my post on the open-source-duplication group here: https://groups.google.com/forum/?fromgroups=#!topic/open-source-deduplication/y_NUg3-k6jk

I'm facing some problems with the examples.

nilesh-c commented 11 years ago

@fgregg, I have a question about the different phases (as seen from a high level) of de-duplication in dedupe. I need to be sure because parallelizing each phase will need a different approach. Am I correct if I say that the following are the basic phases (that can be processed in parallel):

i) Pre-processing/cleaning (Easy to run in parallel)

ii) Active learning (Can be run in parallel, similar implementation avail. How expensive is this, memory-wise and CPU-wise? Do you think this phase may become a bottleneck if we let it run sequentially, say, for 10M records?)

iii) Learning good block rules (How expensive is the greedy algorithm for doing this, memory-wise and CPU-wise? Any ideas on how to do this with MapReduce? Do you think this phase may become a bottleneck if we let it run sequentially, say, for 10M records?)

iv) Blocking (Planned. Easy to run in parallel)

v) Applying string similarity functions on individual blocks (Planned. Can be run in parallel)

vi) Clustering (Can be run in parallel, similar implementation avail.)

Did I miss anything in the de-duplication flow?

fgregg commented 11 years ago

That's the big parts.

Right now, comparing strings is the biggest bottleneck, and promises the largest benefit for parallelization. For the learning stages, it doesn't make a lot of sense to think about parallelization, because the bottleneck is user labeling which won't allow the training set to ever get that big.

Just a note, although I would love to have more folks working on dedupe, you need to make sure your proposal responds to code for america needs, as they are sponsoring org. Make sure @dthompson thinks this is the right track, it seems like CFA wants to build facilities that make dedupe easier to use, while the main benefit of parallelization would be to make dedupe more powerful.

nilesh-c commented 11 years ago

Thanks, Forest.

I have studied a few research papers and have picked a method by which I can immediately start working on (iv) and (v), it's solid and I expect good results - as string matching is the bottleneck, I guess it should be focused on first. I have almost finished my initial proposal, just need to make some changes once I confirm the important point that you mentioned, with @dthompson.

In my proposal, I have a three objectives. 1. Parallelizing single dataset de-duplication with dedupe. 2. Implementing de-duplication (also parallel) with two datasets - making a few changes to 1. And 3 - Making a REST-based API and a web interface/frontend for 1. and 2.

@dthompson do you think this is okay? If you feel that the current focus for GSoC should be making dedupe easier to use and adding support for dual-dataset de-duplication only, I can omit the parallelization bit. I have already added the three of these in my proposal. As per your reply, I shall modify it and post.

nilesh-c commented 11 years ago

I also feel that being able to run the backend of the data matcher over multiple cores or machines would make it actually scalable and capable of processing huge datasets and making changes visible on the frontend in a reasonable amount of time.

mick commented 11 years ago

@nilesh-c if you want to include working on making dedupe more powerful, that sounds great. You should make sure to work with @fgregg how that can best be accomplished.

nilesh-c commented 11 years ago

@dthompson - awesome. Thanks.

nilesh-c commented 11 years ago

@dthompson @fgregg - I have submitted my initial proposal. Please let me know if I can integrate some changes into it to make it better. Do ask me if you need any clarifications.

nilesh-c commented 11 years ago

I just updated my proposal - the link to the "Efficient Parallel Set-Similarity Joins Using MapReduce" was broken. I fixed it. :)

nilesh-c commented 11 years ago

@fgregg Please check out this page and scroll down to the section on performance comparison results: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

The test map-reduce code works on the Google n-grams dataset. It's textual data. Java+Hadoop seems to be 50% faster than Python+Hadoop Streaming. Dedupe's code would be pretty much CPU-bound and textual (there wouldn't be any need for adding binary data types in the future, I suppose? Hadoop Streaming doesn't perform well for binary data). So we can assume, all things being equal, Java may be 50% faster than Python. Again, mixing Java and Python may have its overheads. So the feasible options before us are: i) Python with Hadoop Streaming (which I think I might go with) ii) Java What do you think?

fgregg commented 11 years ago

First, I would want an implementation of parallelism that could take advantage of multiple cores on the same machine. Probably using the multiprocessing library. Of course, we should design this so it's easy to extend this to multiple machines.

The actual string comparators are written, and will continue to be written in C (actually Cython that's compiled to c), so any solution will have to support that. I would need a very compelling reason to include Java into the mix.

On Sat, May 4, 2013 at 3:05 PM, nilesh-c notifications@github.com wrote:

@fgregg https://github.com/fgregg Please check out this page and scroll down to the section on performance comparison results:

http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

The test map-reduce code works on the Google n-grams dataset. It's textual data. Java+Hadoop seems to be 50% faster than Python+Hadoop Streaming. Dedupe's code would be pretty much CPU-bound and textual (there wouldn't be any need for adding binary data types in the future, I suppose? Hadoop Streaming doesn't perform well for binary data). So we can assume, all things being equal, Java may be 50% faster than Python. Again, mixing Java and Python may have its overheads. So the feasible options before us are: i) Python with Hadoop Streaming (which I think I might go with) ii) Java What do you think?

— Reply to this email directly or view it on GitHubhttps://github.com/codeforamerica/dev/issues/6#issuecomment-17440646 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

nilesh-c commented 11 years ago

I understand. Doing it will Apache Hadoop will let it take advantage of multiple cores on a single machine, and can be run in a distributed manner on multiple machines (clusters) seamlessly. I was making sure which language to use with Hadoop. I will stick to Python.

Now, another question:

Will it be better if we make the parallel dedupe (with a dependency on Apache Hadoop) a separate project/repo altogether? If someone just wants to use dedupe quickly on a small dataset on his/her laptop and doesn't feel like installing Hadoop on it, he/she can just clone the normal dedupe. If parallelism is needed, one can clone the Hadoop-based repo, install the dependencies using a script I will provide and proceed.

The problem with the above idea is that developers will have to maintain two separate codebases. OR, we can have both implementations in the current repo and let the user choose before running. This can get a tad bit complicated though, I suppose, but probably worth it. What do you suggest?

fgregg commented 11 years ago

I would prefer to see how far we can go with python's multiprocessing library, which is part of the core python library, before accepting that we need to use Hadoop.

If multiprocessing is acceptable, the same codebase can be used for one cpu or hundreds.

If multiprocessing, alone is unacceptable, then I would still prefer we go to a python solution http://wiki.python.org/moin/ParallelProcessing

On Sun, May 5, 2013 at 4:11 AM, nilesh-c notifications@github.com wrote:

I understand. Doing it will Apache Hadoop will let it take advantage of multiple cores on a single machine, and can be run in a distributed manner on multiple machines (clusters) seamlessly. I was making sure which language to use with Hadoop. I will stick to Python.

Now, another question:

Will it be better if we make the parallel dedupe (with a dependency on Apache Hadoop) a separate project/repo altogether? If someone just wants to use dedupe quickly on a small dataset on his/her laptop and doesn't feel like installing Hadoop on it, he/she can just clone the normal dedupe. If parallelism is needed, one can clone the Hadoop-based repo, install the dependencies using a script I will provide and proceed.

The problem with the above idea is that developers will have to maintain two separate codebases. OR, we can have both implementations in the current repo and let the user choose before running. This can get a tad bit complicated though, I suppose, but probably worth it. What do you suggest?

— Reply to this email directly or view it on GitHubhttps://github.com/codeforamerica/dev/issues/6#issuecomment-17448970 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

nilesh-c commented 11 years ago

Alright. If we eliminate Hadoop (though while using Hadoop we wouldn't have needed to write Java code anyway) for now and look for pure python options, we've got a few to choose from:

i) Raw standard library mapreduce implementation using Pool (Useful for initial testing, but not good for much else. No cluster/distributed file system)

ii) octo.py or mincemeat.py - Very simple Hadoop implementations built upon the Python Standard Library. These just work, but don't compare with high quality implementations. The main problems being - i) there is no distributed file system ii) all data is stored in memory, so if the datasets are huge, they won't fit. iii) All data is sent via the network between the clusters, and for huge datasets, these frameworks don't scale well.

iii) Disco is a very complete implementation which I'd be more inclined to choose. It has lots of documentation and is well-developed. It's mainly in Python with a bit of Erlang code to drive the DFS. It's a nice framework and is probably in the same performance ballpark as Hadoop as benchmarks suggest.

Once we decide what to use, I may start with a bit of prototyping.

nilesh-c commented 11 years ago

@fgregg - Shall I go with Disco? I think it's in line with what you want. It's also on the Python ParallelProcessing page you mentioned.

nilesh-c commented 11 years ago

@fgregg - Please check my forked repo of dedupe here. I have added support for parellelly running core.scoreDuplicates using multiprocessing.Pool, Pool.map(). I just tried this out to measure the execution time difference. Here is the benchmark.

The basic Pool.map() would be good for testing out prototype map-reduce code (probably not a good idea for serious work with big datasets). I can later integrate it with Disco to support distributed processing and the works.

fgregg commented 11 years ago

This is really cool! I've waiting until Code For America decides who they want to go with. They should have their decision soon.

Best,

Forest

On Tue, May 14, 2013 at 4:55 PM, nilesh-c notifications@github.com wrote:

@fgregg https://github.com/fgregg - Please check my forked repo of dedupe here https://github.com/nilesh-c/dedupe. I have added support for parellelly running core.scoreDuplicates using multiprocessing.Pool, Pool.map(). I just tried this out to measure the execution time difference. Here https://github.com/nilesh-c/dedupe/blob/master/parallel_report.mdis the benchmark.

The basic Pool.map() would be good for testing out prototype map-reduce code (probably not a good idea for serious work with big datasets). I can later integrate it with Disco to support distributed processing and the works.

— Reply to this email directly or view it on GitHubhttps://github.com/codeforamerica/dev/issues/6#issuecomment-17907926 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

nilesh-c commented 11 years ago

Thanks Forest! OK, sounds good. On May 15, 2013 3:31 AM, "Forest Gregg" notifications@github.com wrote:

This is really cool! I've waiting until Code For America decides who they want to go with. They should have their decision soon.

Best,

Forest

On Tue, May 14, 2013 at 4:55 PM, nilesh-c notifications@github.com wrote:

@fgregg https://github.com/fgregg - Please check my forked repo of dedupe here https://github.com/nilesh-c/dedupe. I have added support for parellelly running core.scoreDuplicates using multiprocessing.Pool, Pool.map(). I just tried this out to measure the execution time difference. Here https://github.com/nilesh-c/dedupe/blob/master/parallel_report.mdis the benchmark.

The basic Pool.map() would be good for testing out prototype map-reduce code (probably not a good idea for serious work with big datasets). I can later integrate it with Disco to support distributed processing and the works.

— Reply to this email directly or view it on GitHub< https://github.com/codeforamerica/dev/issues/6#issuecomment-17907926> .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

— Reply to this email directly or view it on GitHubhttps://github.com/codeforamerica/dev/issues/6#issuecomment-17908182 .

webmaven commented 10 years ago

Was this work ever completed?

fgregg commented 10 years ago

Which part?

webmaven commented 10 years ago

The parallelization, mostly.

But I am also curious as to whether any of his work on the fork at https://github.com/nilesh-c/dedupe was ever merged back.

fgregg commented 10 years ago

Yup, parallelization has been implemented. http://dedupe.readthedocs.org/en/v0.5.3.0/API-documentation.html#Dedupe

On Thu, Jul 17, 2014 at 9:44 AM, Michael R. Bernstein < notifications@github.com> wrote:

The parallelization, mostly.

But I am also curious as to whether his work on the fork at https://github.com/nilesh-c/dedupe was ever merged back.

— Reply to this email directly or view it on GitHub https://github.com/codeforamerica/dev/issues/6#issuecomment-49316244.

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

nilesh-c commented 10 years ago

Ah, I wanted to see it happen too. Pretty cool. :)