Police-Data-Accessibility-Project / data-source-identification

Scripts for labeling relevant URLs as Data Sources.
MIT License
5 stars 6 forks source link

Agency Table Deduplication #17

Open mbodeantor opened 1 year ago

mbodeantor commented 1 year ago

Fixes

Description

Duplicate, incorrect entries The database seems to be populated with agencies that share a town name from different states, but the homepage urls for each are the same This may cause links to be matched to agencies in incorrect states Listed below are a few of the ones I found while working: Knoxville TN, IL Airtable link Harrisburg PA, AR, IL, OH Airtable link East Hampton NY, CT Airtable link La Porte TX, IN Airtable link New Castle WA, NE Airtable link Greenville SC, IN, ME Airtable link Stockton CA, KS Airtable link Altoona PA, IA Airtable link Beaver County PA, OK Airtable link There are also some entries for nearby towns that have separate police departments, but their homepage url are all the same. Such as Mattoon, Shelbyville, and Sullivan, IL Airtable link are all listed with the homepage mattoon.illinois.gov despite being different towns with their own police departments.

josh-chamberlain commented 8 months ago

I manually did these, and fixed a few other duplicates. According to Airtable, there are ~1000 other sets of duplicates.

maxachis commented 8 months ago

Importing the Agency table into a SQLite database, I was able to retrieve the duplicates using the following:

SELECT a1.name, a2.name, a1.state_iso AS state1, a2.state_iso AS state2, a1.homepage_url
FROM agencies AS a1
JOIN agencies AS a2 
    ON 
    a1.state_iso != a2.state_iso
    AND a1.homepage_url = a2.homepage_url
WHERE a1.rowid < a2.rowid; -- to avoid duplicate pairs in reverse order
maxachis commented 8 months ago

I can get to work on creating a view, but first things first I'll need access to the Airtable

josh-chamberlain commented 8 months ago

@maxachis since you're already making a sqlite database, could you output JSON? it could meet the list requirement without relying on external resources. I can still get you Airtable access when I'm next at a computer, it has a handy manual dedupe feature that found 1100 sets...

maxachis commented 8 months ago

@josh-chamberlain A little unclear on what is meant here by "could you output JSON". If you mean can I output the results of my SQL script as a JSON file, certainly! I've attached them here. csvjson.json

josh-chamberlain commented 8 months ago

@maxachis yes, that's what I meant! This suffices as a "view/list" without needing to use the Airtable API as part of the process.

It does look like these duplicates are captured as a bunch of entries of pairs, as opposed to name 1 name 2 name 3 etc, not sure if it'd be better to group them into one set. for example the amtrak ones, which should probably just end up as one agency. That said, if this is still usable by downstream processes...we're good.

I think it'd take 2–3 hours to manually dedupe them all in Airtable, and we're not adding Agencies as quickly as we once were. I would volunteer for this task if that ends up being the best way

Screen Shot 2024-02-23 at 9 51 56 AM
maxachis commented 8 months ago

It may be worth considering automating Google searches with the names to see if we can find home pages through that manner. The logic would be

This would entail the use of the Custom Search JSON API, which allows 100 free searches per day.

Would this work? No idea. This is predicated on the idea that we can reliably link names with the first result of such a search. But could be worthwhile as a point of experimentation. If successful, may have implications for other work.

maxachis commented 8 months ago

Just tested this out on a small sample set. The results weren't bad! In a number of cases I was able to successfully generate the correct websites for both entries in a duplicate pair (I only tested duplicate pairs, but this can apply to larger groups of duplicates). There are several caveats.

Three questions I have for this (and for @josh-chamberlain) are:

  1. Do we consider this a useful enough partial solution for duplicate agency entries for me to explore developing it further?
  2. Is this something that could be useful making a part of the automation process (e.g., every so often, scanning for url duplicates, pulling the recommended resolutions, and then have someone decide whether to manually approve or deny these resolutions)
  3. Could this method of utilizing automated google searches be useful in other domains of PDAP?
josh-chamberlain commented 8 months ago

Thank you!

  1. yes, this is useful! keep in mind that once we get through these ~1000 sets of duplicates, we won't be adding agencies/duplicates nearly as quickly. you may get diminishing returns.

  2. yes, manual confirmation is critical.

  3. it could help us find sources of data on request—maybe a google search for "pittsburgh arrest data" could generate 10 URLs to be checked by the data source identification pipeline.

Just to be clear, the Amtrak one would be considered a single federal agency. We don't need to have all the state-level agencies represented.

maxachis commented 8 months ago

With regard to 3, I wonder if it may be useful to consolidate Federal Agencies involving multiple states into one entity on this table and perhaps break them down in a separate table? That may help avoid confusion.

josh-chamberlain commented 8 months ago

@maxachis yep, federal agencies involving multiple states should still just be one agencies. For example, those amtrak ones—they should all be the same!

maxachis commented 8 months ago

So good news and bad news: The good news is, I've started creating a standalone repo which, when completed, will be able to process airtable data, identify possible duplicates, and then create an interface for resolving potential duplicates.

The bad news: If we are counting agencies which did not have a URL in the first place, the number of duplicates we have increases substantially. Several thousand have no URLs whatsoever. Now we could ignore those, make them a lower priority, or do with them what we intend to do with the duplicates. But I will need @josh-chamberlain 's thoughts on that.

maxachis commented 8 months ago

Here is a current view of the deduplication interface: image

Every time you accept or reject a URL, it marks that entry as reviewed in a database and then goes on to the next unreviewed entry with a proposed URL, if it exists. I additionally added a handy-dandy feature that loads the page in an iframe, if the page we're trying to access allows for that (which they do not always do).

While I can make a few tweaks for the interface, this seems like a fairly decent workflow.

Right now, here are my main queries:

josh-chamberlain commented 8 months ago

@maxachis wow, I didn't realize you were making a manual in-browser deduping tool! I can see now that was always the case, but I misunderstood and thought you were trying to resolve these automatically with the google searches strategy. Anyway, this is cool. Airtable has a decent de-duping app, so we probably don't need our own interface for it, but this one does have some benefits and I can see it being useful for a bunch of other types of data validation/cleaning.

From a user experience design perspective, we should probably clearly define the two use cases: deduping existing data, and identifying correct agency URLs.

  1. Let's hold until we know we need more. We can pay for a few thousand calls if needed.

  2. I tend to assume the user will be a volunteer without much context → public repo with enough context for a stranger. Then, if staff need to use it, it'll work. Probably, it will be me and anyone else I can rope into the tedium.

  3. Let's treat empty ones as a separate issue. Many of these agencies don't have web presence.

How about this: I created a data-cleaning repo where you can put this work. I don't think it's worth doing too much more refinement, but it's definitely worth making sure this is usable so we can use it and refine as needed. At the end of the process, the user should be able to dump the modified entries to CSV via console command. It should have at least these properties from the schema to make it so that a simple import will resolve everything:

I think a CSV makes sense here, rather than trying to update them via the API, for future-proofing and flexibility.

maxachis commented 8 months ago

@josh-chamberlain So do we want something like this? I currently have it so that approved changes are exported to csv. Anything rejected (or never considered in the first place) simply doesn't show up in the export.

export.csv

maxachis commented 8 months ago

https://github.com/Police-Data-Accessibility-Project/data-cleaning/pull/1

Created a pull request in the new repository, for your reviewing pleasure, @josh-chamberlain .