Automated search engine calls

maxachis commented 7 months ago

This is related to the solution I proposed in https://github.com/Police-Data-Accessibility-Project/data-cleaning/pull/1, which leveraged using Google Searches to find police agencies.

Given a list of police districts, cities, counties, etc., we could automate a call to the Google Search Engine API or something similar with the string "{Location} Police Agency Website", and then pull the first result. We could start with the most populous cities/counties/etc. in the US and gradually work our way down. Perhaps including additional searches such as searching for all police districts in a location.

The API described above allows 100 free searches per day (or 5$ for 1000). If we create a Github Action, we could easily run that in the background each day and gradually add a trickle of results. We could include either only the first result, or a set of results, depending on how confident we are that the first result will pull what we're looking for.

This would not accrue as many URLs as quickly, but I suspect it will have a higher hit-to-miss ratio for relevant data.

josh-chamberlain commented 7 months ago

Great! Suggested strategy:

[ ] hit the Google Search Engine API 100 times each day
- generate a string "{agencies.name} {record_type}", cycling through all record types for each agency before moving on
  - instead of List of Data Sources, use "data portal"
- try including the first 10 results; stripping out irrelevant ones with HF models seems to be pretty easy
[ ] add the results to a hugging face dataset. Since we're adding 100 a day, we can append to the same CSV, called something like google_api_urls (see #40)
[ ] keep track of which ones we've tried in a CSV in this repo, and don't try those again
- use agencies.count_data_sources to pick which ones to try first, starting with those that have the most sources. They already skew toward populous cities which are known to already have sources.
  - from there, check other agencies in those counties. then...I don't know. randomly?
  - keep track of the date we tried it last, so we can try it again

@mbodeantor feel free to refine / adjust any of this ↑

maxachis commented 7 months ago

First step would be finding data detailing all the different police agencies. I found a good start with the results of the 2018 Census of State and Local Law Enforcement Agencies (CSLLEA). I've also included the results of this census (for parsing in R) as a zip file:

ICPSR_38771-V1.zip

Additionally, here's the primary data, as an excel file:

2018 Police Agencies Census Data.xlsx

17721 rows, and there's quite a bit of detail in this: Not just the location of each agency, but also:

Their 2018 estimated budget
What type of agency they are (Sheriff's Department, tribal, college, etc.)
What functionalities they served (for example, warrants, whether they investigated homicide, arson, cybercrime, whether they were used for crowd control)
And more.

The full set of information can be described in the code book contained within the zip.

I don't want to say that this is the absolute best resource we could use for creating a master table for all police agencies in the United States, but it is hard for me to imagine something better.

@josh-chamberlain I think this data, or at least parts of it, would be worthwhile to add to our dataset, perhaps eventually merging with our agency table. From there, we could connect this information with other information of relevance (for example: the wealth/population of the associated city/county) and use that associated information to prioritize which agencies to start with.

mbodeantor commented 7 months ago

I agree with Josh that starting with the agencies we have the most data sources for already seems likely to yield the most relevant results. Not sure it makes sense to start considering additional agencies until after the ones we already have.

maxachis commented 7 months ago

So if we were to start with the agencies we have the most data sources for (and which don't currently have a homepage url) that actually wouldn't take too much time. The script as I'm envisioning it would be

SELECT
    SUBMITTED_NAME,
    HOMEPAGE_URL,
    JURISDICTION_TYPE,
    STATE_ISO,
    MUNICIPALITY,
    COUNTY_NAME,
    AIRTABLE_UID,
    COUNT_DATA_SOURCES,
    ZIP_CODE,
    NO_WEB_PRESENCE -- Relevant
FROM
    PUBLIC.AGENCIES
WHERE 
approved = true
and homepage_url is null
ORDER BY COUNT_DATA_SOURCES DESC

That produces about 5,800 (with about 60 having 1 or more data sources). So that'd be a good place to start.

maxachis commented 7 months ago

@mbodeantor @josh-chamberlain I may need permissions for the database updated so that I have CREATE_TABLE permissions, as I'd like to create a small additional table to keep track of which agencies have had a search performed on them (so I don't run the risk of redoing searches for them). Either that, or I would need the table created for me.

mbodeantor commented 7 months ago

@maxachis You should still have create permissions on the public schema. What issue are you seeing?

maxachis commented 7 months ago

@mbodeantor I'm running into this error when attempting to create a table through PGAdmin:

This is via the username of data_sources_app

mbodeantor commented 7 months ago

Yeah you'll have to use your username and pass in your DMs

maxachis commented 7 months ago

@mbodeantor Ah that did it. Was able to create the table, but still got denied on adding a foreign key constraint to agencies (which will help me ensure that anything in the cache is actually properly referencing an agency uid). Got the following error:

I believe I need the REFERENCES and possibly the ALTER permissions.

mbodeantor commented 7 months ago

@maxachis I granted you all permissions on that schema, lmk if that doesn't do it.

maxachis commented 7 months ago

@mbodeantor Nope, still getting an error of permission denied for table agencies. Here's the SQL query:

ALTER TABLE agency_url_search_cache
ADD CONSTRAINT FK_AGENCY_UID
FOREIGN KEY (agency_airtable_uid)
REFERENCES public.agencies(airtable_uid);

maxachis commented 7 months ago

Draft version of PR updated. Remaining tasks:

Create unit/integration tests for code
Create yaml workflow for Github actions, setting script to be run once per day.

maxachis commented 5 months ago

@josh-chamberlain The current version of this involves direct database access -- specifically, it updates an in-database "cache", agency_url_search_cache that indicates what agencies have been searched for, so that it doesn't duplicate searches. It directly references the agencies table as a foreign key.

There are other ways to do this. I could maintain the cache in the repository, or I could build an endpoint.

The repository provides the most immediate solution to this, but if we want to run this regularly, it will need to be committed regularly, providing commit clutter*. It'll also be less atomic, meaning we could lose a whole day of Google searches if something breaks. I don't think an in-repo cache is the best option.

Developing an endpoint means we'd have to build a v2 or post-v2 endpoint, and couldn't deploy it until then. That means its development is contingent on the development of those endpoints as well.

Third option is we just have it continue to interface directly with the database. Simplest, but goes against our attempts to keep all database interactions within the endpoint.

Let me know your thoughts.

*Even more than my current granular commits add clutter!

josh-chamberlain commented 5 months ago

@maxachis maybe this is an opportunity to make an endpoint for caching searches—this is something we want to do, broadly, anyway—it's worth knowing, for basic context, not only the last time we updated but searched for type of data (like homepage) in agency or jurisdiction. For that reason, I think it's worth making an endpoint—later on we could abstract it for other types of caches, but it could be bespoke for now.

Can you clarify why it would need to be v2 or post-v2?

maxachis commented 5 months ago

Can you clarify why it would need to be v2 or post-v2?

@josh-chamberlain No reason it has to be, per se. I assumed it would be since I figured we were avoiding significant changes to main if not contained within v2 or post-v2. However, if we're comfortable with developing an endpoint in the primary data-source-app repository, prior to v2, then I'm happy to do that as well. To do so would, however, either necessitate not incorporating some of the changes I'm setting up in v2 (like integration tests) or else duplicating those efforts.

josh-chamberlain commented 5 months ago

gotcha. that makes sense...I think it'd be better to effectively make a "legacy" endpoint without those tests and soon upgrade it to v2, duplicating effort that way rather than bringing tests into v1. We can wait until our hand is forced; maintaining a cache in the repo isn't the end of the world as a stopgap. Using the repo feels simpler/less annoying, but I'm not most affected. Your call!

maxachis commented 5 months ago

@josh-chamberlain Got it. I will develop a legacy endpoint, probably called \agency-search-cache, and then have my code make a call to it.

Using the repo feels simpler/less annoying, but I'm not most affected. Your call!

I've seen repos where identical commits were made nearly every day to refresh some data, and man: even if it seems simpler initially, once you start seeing 30, 60, 90 identical commits over the course of months, it makes things clunky.

maxachis commented 4 months ago

@josh-chamberlain For calling the endpoint, I required an API key authorization, then realized I'm not sure if I have one!

After doing a bit more research, I realized a few things:

The Gitbook doesn't appear to mention how one gets an API key to access the more sensitive endpoints. I'm sure I could just get admin rights, but that brings me to:
An endpoint modifying the agencies cache probably should require elevated permissions, which would require a modification of the security.py at the moment (and probably should be modified to something more extensible later).

How should we approach?

josh-chamberlain commented 4 months ago

@maxachis

You can get an API key by hitting the api_key endpoint. Looks like we currently only have one admin in the database.
True! Is it time to take the first step towards ABAC/PBAC? https://github.com/Police-Data-Accessibility-Project/data-sources-app/issues/162

Police-Data-Accessibility-Project / data-source-identification

Automated search engine calls #53