Agency Table Deduplication

mbodeantor commented 1 year ago

Fixes

In addition to general data quality, inaccurate data in the Agencies table can cause further inaccuracies when using it to try to identify new data sources

Description

Scan the table to see there are ways to identify issues en masse
Here are some example issues Kylie found:

Duplicate, incorrect entries The database seems to be populated with agencies that share a town name from different states, but the homepage urls for each are the same This may cause links to be matched to agencies in incorrect states Listed below are a few of the ones I found while working: Knoxville TN, IL Airtable link Harrisburg PA, AR, IL, OH Airtable link East Hampton NY, CT Airtable link La Porte TX, IN Airtable link New Castle WA, NE Airtable link Greenville SC, IN, ME Airtable link Stockton CA, KS Airtable link Altoona PA, IA Airtable link Beaver County PA, OK Airtable link There are also some entries for nearby towns that have separate police departments, but their homepage url are all the same. Such as Mattoon, Shelbyville, and Sullivan, IL Airtable link are all listed with the homepage mattoon.illinois.gov despite being different towns with their own police departments.

josh-chamberlain commented 8 months ago

[x] Knoxville TN, IL Airtable link
[x] Harrisburg PA, AR, IL, OH Airtable link
[x] East Hampton NY, CT Airtable link
[x] La Porte TX, IN Airtable link
[x] New Castle WA, NE Airtable link
[x] Greenville SC, IN, ME Airtable link
[x] Stockton CA, KS Airtable link
[x] Altoona PA, IA Airtable link
[x] Beaver County PA, OK Airtable link
[x] mattoon police site

I manually did these, and fixed a few other duplicates. According to Airtable, there are ~1000 other sets of duplicates.

check for agencies which have duplicate homepage URLs
put them in a list / view somewhere
periodically manually correct them by finding the correct website
if a suitable website can't be find, it's often a no_web_presence agency, they're using the sheriff as their local police, or getting police service from a nearby town.
it's better to have an empty homepage_url than a false positive

maxachis commented 8 months ago

Importing the Agency table into a SQLite database, I was able to retrieve the duplicates using the following:

SELECT a1.name, a2.name, a1.state_iso AS state1, a2.state_iso AS state2, a1.homepage_url
FROM agencies AS a1
JOIN agencies AS a2 
    ON 
    a1.state_iso != a2.state_iso
    AND a1.homepage_url = a2.homepage_url
WHERE a1.rowid < a2.rowid; -- to avoid duplicate pairs in reverse order

maxachis commented 8 months ago

I can get to work on creating a view, but first things first I'll need access to the Airtable

josh-chamberlain commented 8 months ago

@maxachis since you're already making a sqlite database, could you output JSON? it could meet the list requirement without relying on external resources. I can still get you Airtable access when I'm next at a computer, it has a handy manual dedupe feature that found 1100 sets...

maxachis commented 8 months ago

@josh-chamberlain A little unclear on what is meant here by "could you output JSON". If you mean can I output the results of my SQL script as a JSON file, certainly! I've attached them here. csvjson.json

josh-chamberlain commented 8 months ago

@maxachis yes, that's what I meant! This suffices as a "view/list" without needing to use the Airtable API as part of the process.

It does look like these duplicates are captured as a bunch of entries of pairs, as opposed to name 1 name 2 name 3 etc, not sure if it'd be better to group them into one set. for example the amtrak ones, which should probably just end up as one agency. That said, if this is still usable by downstream processes...we're good.

I think it'd take 2–3 hours to manually dedupe them all in Airtable, and we're not adding Agencies as quickly as we once were. I would volunteer for this task if that ends up being the best way

maxachis commented 8 months ago

It may be worth considering automating Google searches with the names to see if we can find home pages through that manner. The logic would be

For all entries duplicated by homepage_url, perform a google search for the name
Take the first result, assuming that is the most relevant (we could perhaps amend this further by writing them as '{name} Police Department Home Page' when we search for them.

This would entail the use of the Custom Search JSON API, which allows 100 free searches per day.

Would this work? No idea. This is predicated on the idea that we can reliably link names with the first result of such a search. But could be worthwhile as a point of experimentation. If successful, may have implications for other work.

maxachis commented 8 months ago

Just tested this out on a small sample set. The results weren't bad! In a number of cases I was able to successfully generate the correct websites for both entries in a duplicate pair (I only tested duplicate pairs, but this can apply to larger groups of duplicates). There are several caveats.

I tested only on entries that were clearly not federal entities, such as Amtrak or the Coast Guard. These I suspect are less likely to have state-unique websites.
Obviously, this method doesn't work if there is no website for the agency. Instead, Google will pull what it considers the most relevant answer -- sometimes a website relevant to the state or town, sometimes another agency for the same town name with a website (in some cases the same duplicate url), sometimes something completely different.
This still would work best with manual inspection to verify that the URL is correct.

Three questions I have for this (and for @josh-chamberlain) are:

Do we consider this a useful enough partial solution for duplicate agency entries for me to explore developing it further?
Is this something that could be useful making a part of the automation process (e.g., every so often, scanning for url duplicates, pulling the recommended resolutions, and then have someone decide whether to manually approve or deny these resolutions)
Could this method of utilizing automated google searches be useful in other domains of PDAP?

josh-chamberlain commented 8 months ago

Thank you!

yes, this is useful! keep in mind that once we get through these ~1000 sets of duplicates, we won't be adding agencies/duplicates nearly as quickly. you may get diminishing returns.
yes, manual confirmation is critical.
it could help us find sources of data on request—maybe a google search for "pittsburgh arrest data" could generate 10 URLs to be checked by the data source identification pipeline.

Just to be clear, the Amtrak one would be considered a single federal agency. We don't need to have all the state-level agencies represented.

maxachis commented 8 months ago

With regard to 3, I wonder if it may be useful to consolidate Federal Agencies involving multiple states into one entity on this table and perhaps break them down in a separate table? That may help avoid confusion.

josh-chamberlain commented 8 months ago

@maxachis yep, federal agencies involving multiple states should still just be one agencies. For example, those amtrak ones—they should all be the same!

maxachis commented 8 months ago

So good news and bad news: The good news is, I've started creating a standalone repo which, when completed, will be able to process airtable data, identify possible duplicates, and then create an interface for resolving potential duplicates.

The bad news: If we are counting agencies which did not have a URL in the first place, the number of duplicates we have increases substantially. Several thousand have no URLs whatsoever. Now we could ignore those, make them a lower priority, or do with them what we intend to do with the duplicates. But I will need @josh-chamberlain 's thoughts on that.

maxachis commented 8 months ago

Here is a current view of the deduplication interface:

Every time you accept or reject a URL, it marks that entry as reviewed in a database and then goes on to the next unreviewed entry with a proposed URL, if it exists. I additionally added a handy-dandy feature that loads the page in an iframe, if the page we're trying to access allows for that (which they do not always do).

While I can make a few tweaks for the interface, this seems like a fairly decent workflow.

Right now, here are my main queries:

[x] 1. Do we want to get these queries for free over time, or more quickly with money? Getting the suggested URLs is slow if we don't want to spend money. We get 100 free queries per day per account. If we do want to spend money, it's $5 for 1000 queries (aka half a cent per query). Considering we have at least a thousand duplicates (plus 4000+ more if we count those entries which have no URLs at all), the free option would take at least 50 days to get all the URLs, and that's assuming we set up a workflow to pull it regularly.
[x] 2. Who do we envision using this in this form? Whether it's something staff or committed volunteers use, versus something we want to expand to others, would inform how the repo is structured and how/where it's hosted.
[x] 3. And, as I mentioned above, do we want to extend this approach to agencies which do not currently have URLs listed?

josh-chamberlain commented 8 months ago

@maxachis wow, I didn't realize you were making a manual in-browser deduping tool! I can see now that was always the case, but I misunderstood and thought you were trying to resolve these automatically with the google searches strategy. Anyway, this is cool. Airtable has a decent de-duping app, so we probably don't need our own interface for it, but this one does have some benefits and I can see it being useful for a bunch of other types of data validation/cleaning.

From a user experience design perspective, we should probably clearly define the two use cases: deduping existing data, and identifying correct agency URLs.

Let's hold until we know we need more. We can pay for a few thousand calls if needed.
I tend to assume the user will be a volunteer without much context → public repo with enough context for a stranger. Then, if staff need to use it, it'll work. Probably, it will be me and anyone else I can rope into the tedium.
Let's treat empty ones as a separate issue. Many of these agencies don't have web presence.

How about this: I created a data-cleaning repo where you can put this work. I don't think it's worth doing too much more refinement, but it's definitely worth making sure this is usable so we can use it and refine as needed. At the end of the process, the user should be able to dump the modified entries to CSV via console command. It should have at least these properties from the schema to make it so that a simple import will resolve everything:

airtable_uid (our primary key)
submitted_name (in case that changes, ever—note that name is just the submitted name and state concatenated)
homepage_url (will override the existing one)
approved (checked or unchecked are the current options)
no_web_presence (checked or unchecked), seems like we won't use this right away
rejection_reason (if we reject one, we should use the string duplicate or data cleaning)

I think a CSV makes sense here, rather than trying to update them via the API, for future-proofing and flexibility.

maxachis commented 8 months ago

@josh-chamberlain So do we want something like this? I currently have it so that approved changes are exported to csv. Anything rejected (or never considered in the first place) simply doesn't show up in the export.

export.csv

maxachis commented 8 months ago

https://github.com/Police-Data-Accessibility-Project/data-cleaning/pull/1

Created a pull request in the new repository, for your reviewing pleasure, @josh-chamberlain .

Police-Data-Accessibility-Project / data-source-identification

Agency Table Deduplication #17

Fixes

Description