18F / census-similarity

Small set of commands to find similarity between data sets
Other
1 stars 3 forks source link
unmaintained

Census Similarity

A small set of commands for finding similarity between data sets

Installation with Docker

The simplest installation is to use the automatically-built Docker image. Docker can be thought of like a very-fast virtual machine that carries around all of its environment dependencies with it. It communicates with the outside world through stdin and stdout.

To install/run using Docker, you'll need to have Docker installed and then preface each command (see Usage) with

docker run -it --rm 18fgsa/census-similarity

That tells docker to accept input from stdin and delete the app after execution. As it's a bit of a mouthful, we recommend using a shorthand variable:

census="docker run -it --rm 18fgsa/census-similarity"

# e.g.
cat some.csv | census group_by

Installation via Python

We'll assume you've already cloned this repository, installed Python3, and set up a virtualenv.

Next, try

pip install -r requirements.txt

If this fails, look at the error message for a missing C library. You may also try the instructions provided by SciPy.

Usage

This application consists of several small utilities which should be strung together to get a useful result. We provide some examples below, but we can't list all of the relevant permutations. Not only can the utilities be combined but several have additional parameters that users may want to tune to their dataset.

Commands

Execute a command and add the --help for a more thorough description, but at the high level:

Examples

For these examples, we'll assume two files:

We'll be trying to find similar datasets.

Clustering by dataset name

First, let's take a very simple approach: we'll cluster the datasets by looking at similar dataset names:

cat dataset.csv | cluster_by_field | group_by --min-group-size 2

Let's break that down:

The output of the above will include one row for each dataset cluster, listing the ids within that cluster. While very useful, that's not a very nice visual. If we wanted to collect dataset names instead, we'd run:

cat dataset.csv \
  | cluster_by_field \
  | group_by --min-group-size 2 --accumulation-field name

A second method of achieving a similar result would be to "lookup" the dataset names by referencing their ids:

cat dataset.csv \
  | cluster_by_field \
  | group_by --min-group-size 2 \
  | lookup --lookup-file dataset.csv --source-field id

Finally, let's try to tweak our results. Let's try slightly different distance algorithms and loosen our definition of similarity. Warning: this will run much slower:

cat dataset.csv \
  | cluster_by_field --eps 0.3 --min-samples 4 --distance-metric cosine \
    --field-split trigram \
  | group_by --min-group-size 2 --accumulation-field name

Clustering datasets by their columns

As a more complicated example, let's try finding similar fields within datasets, then cluster the datasets based on those shared fields. For example, if we two datasets, one with fields "SSN" and "First Name" and another with "ssns" and "name_frst" we'd want to mark those two datasets as "related".

cat vars.csv \
  | cluster_by_field --field vname --group-field field_cluster \
  | group_by --group-field dsids --accumulation-field field_cluster \
  | cluster_by_field --field field_cluster --field-split comma \
    --group-field dataset_cluster \
  | group_by --min-group-size 2 --group-field dataset_cluster \
    --accumulation-field dsids \
  | lookup --lookup-file dataset.csv --source-field dsids

Let's break that down -- we'll skim over pieces explained in the previous example.

Next steps

This project was put together as a proof of concept. While basic functionality exists, it is by no means complete.

From the functionality perspective, this application has focused on string similarity as the core metric of similarity. We can build layers on top of that (e.g. clustering datasets by fields, per the example above), but there are other avenues of inspection that might be more helpful. For example, the Census datasets have relations (datasets may have "parents"); we've ignored this structure altogether. Similarly, we've ignored field types and other metadata which may have been useful (when properly weighed). More importantly, we're only working with the metadata about these datasets now; clustering using the data proper would likely prove more fruitful.

From the technical perspective, our young app has already picked up some baggage. Most notably, we are missing thorough code review (hopefully to be remedied soon) and automated tests. The existing code quality is a-okay for a quick pilot, but would be worth improving in a longer-term project. In specific, we'd recommend replacing many of the existing, custom functionality with frameworks like pandas, which includes well-tested, efficient libraries to solve many of these problems.

Contributing

See CONTRIBUTING for additional information.

Public domain

This project is in the worldwide public domain. As stated in CONTRIBUTING:

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.