codeforamerica / dev

Discuss general civic development topics here. Watch this repo to follow the conversation.
4 stars 3 forks source link

GSOC 2013 Proposal - Automated Data Matching #3

Open nikitsaraf opened 11 years ago

nikitsaraf commented 11 years ago

Hello

I am Nikit Saraf, sophomore Computer Science undergraduate at Dhirubhai Ambani Institute of Information and Communication Technology, India.

I was going through the Ideas and found "Automated Data Matching" to be particularly interesting. I have downloaded and Installed dedupe and worked on a couple of examples.

But I don't have clear idea about the project and consequently have some doubts regarding it.

  1. For what kind of data are we looking to resolve duplication ? Is it the same kind, which dedupe uses to explain in the examples? Is it possible I could get some sample data to play with ?
  2. I believe more than the data, the Data Source is important. So what kind of data sources we would be dealing with.
  3. Do we just need to extend the dedupe project, build an interface over it and make it as a standalone library to be used by everyone ?

Pardon me, if the above questions seem to be too obvious.

nilesh-c commented 11 years ago

Hi, I'm a 3rd year student of computer science, pursuing my Bachelor's degree at RCC Institute of Information Technology, India. I am interested in learning about this too, having the same doubts as you do. Some clarification of what types of data we are looking at would be very helpful, especially if someone can provide example datasets in csv/xml format to play with and get an idea.

Cheers, Nilesh

mick commented 11 years ago

@nikitsaraf @nilesh-c

Let me help clarify.

  1. The focus of project as of now is matching records from 2 datasets based on address found in both datasets. Many of the datasets we have been working on with cities do not clearly match up with data from other sources, or data from other departments in the city. Focusing on the use case of datasets match on address is just a place to start, this tool could also prompt the user to select which columns they would like to match.
  2. For this project lets assume the datasources are all csv files. We could support other sources as well, but csv are common to government
  3. Building this as a tool that includes dedupe as a dependency I think makes the most sense. Dedupe is a powerful tool, so making it easier to use would be great.
nikitsaraf commented 11 years ago

Hi Mick!

Thank you so much for your prompt reply and helping me clarify my doubts.

As, I said before, I have dedupe installed and running on my system. I tried a couple of examples on their sample data and it is fairly easy to use without any complications.

Can you provide me some more details on the use-case of this tool ? Who will utilize this tool (To decide whether to build a Web-Based tool or a python tool itself with a simpler User Interface) ? So, that I can start thinking over the User-Interface and the level of abstraction to be given to this tool.

Also, If you can provide me with some your sample data, I can test it on dedupe, and check whether it can serve our use-cases.

mick commented 11 years ago

@nikitsaraf,

The use case that we have been talking about is user to run this all in their browser, so they dont even need to install a tool. It should be flexible enough to all for the user to select what columns to match on, provide training, and work through manually matching if needed (this might make more sense as a separate tool)

Dedupe has some sample data you can get started with. But if you want something more advanced I'd suggest grabbing two datasets off https://data.sfgov.org/ (or another city's open data portal) that include address that should match, like all businesses vs restaurant inspection scores

nikitsaraf commented 11 years ago

@dthompson I have submitted my proposal. Please review and let me know for any clarifications.