codeforamerica / Data-Wiki

A Drupal-based front end for merging and improving datasets
GNU General Public License v2.0
10 stars 6 forks source link

workflow: merging #33

Open chachasikes opened 12 years ago

chachasikes commented 12 years ago

Now that we have examples of data the problem of merging is one that we get to deal with.

In this case, we are mashing up contributed data -- from CSV files (existing data sets) as well as individual submissions of records.

The awesome thing is now we have a bunch of data - and implicit in the fact that some data is shared multiple times, it should be able to be 'flagged' as a resource that is off the ground running with extra recommendations.

A future tool that can help -- having all data in spreadsheet format will allow it to be more quickly merged and sorted by humans in a tool like Google Refine.

In essence, people are contributing the location and topic information about existing communication channels in a city. So the idea is to just recommend a website that is actually important - by citizens and neighborhoods - which is something that would not necessarily be captured by web searches or marketing because the power is in referral.

So they should be able to recommend 1 channel - like 'this FB page is actually awesome.'

But what is happening is that 5 people recommend the same resource 5 times - though sometimes they share FB, sometimes the website, sometimes a mailing list. Each of these are 'communication channels' or 'group formats'

Thing is -- there are like 40 formats of channels, and more emerging all the time. So I made a taxonomy of those formats - and each record is stored with its assocated information.

But it makes some sense to provide a 'master record' - 1 record, 10 communication channels, each one labelled by format.

Translate that to a CSV... you have to make 10 columns and store the data for each channel like channel_facebook, channel_twitter...

That seems nutty and doomed to die and is kind of a horrible format for people who are being asked to translate an existing database to a relatively simple schema (url + type of communication)

A document database would be handy.

Parents and ancestors seem like the way to go -- parent + each format is a child. But you can't really work with CSV files like that.

I did happen to add a flag feature to all content - so it is possible to download all of the duplicates (simple computer merging won't work because these are actually people's recommendations that are being shared - and the content needs to be read and then balanced out by a real person.

And I have UUID's (hashes) and almost had revision hashes.

Also we don't totally want to lose the value of the fact that the piece of data was contributed.

Not sure what a librarian would do here. One of these needs might have to budge, and or decide on short term solutions. But before we have to get to the short term solution, it's work trying to think it through.