Open maxachis opened 7 months ago
Great! Suggested strategy:
List of Data Sources
, use "data portal"
google_api_urls
(see #40)agencies.count_data_sources
to pick which ones to try first, starting with those that have the most sources. They already skew toward populous cities which are known to already have sources.
@mbodeantor feel free to refine / adjust any of this ↑
First step would be finding data detailing all the different police agencies. I found a good start with the results of the 2018 Census of State and Local Law Enforcement Agencies (CSLLEA). I've also included the results of this census (for parsing in R) as a zip file:
Additionally, here's the primary data, as an excel file:
2018 Police Agencies Census Data.xlsx
17721 rows, and there's quite a bit of detail in this: Not just the location of each agency, but also:
The full set of information can be described in the code book contained within the zip.
I don't want to say that this is the absolute best resource we could use for creating a master table for all police agencies in the United States, but it is hard for me to imagine something better.
@josh-chamberlain I think this data, or at least parts of it, would be worthwhile to add to our dataset, perhaps eventually merging with our agency table. From there, we could connect this information with other information of relevance (for example: the wealth/population of the associated city/county) and use that associated information to prioritize which agencies to start with.
I agree with Josh that starting with the agencies we have the most data sources for already seems likely to yield the most relevant results. Not sure it makes sense to start considering additional agencies until after the ones we already have.
So if we were to start with the agencies we have the most data sources for (and which don't currently have a homepage url) that actually wouldn't take too much time. The script as I'm envisioning it would be
SELECT
SUBMITTED_NAME,
HOMEPAGE_URL,
JURISDICTION_TYPE,
STATE_ISO,
MUNICIPALITY,
COUNTY_NAME,
AIRTABLE_UID,
COUNT_DATA_SOURCES,
ZIP_CODE,
NO_WEB_PRESENCE -- Relevant
FROM
PUBLIC.AGENCIES
WHERE
approved = true
and homepage_url is null
ORDER BY COUNT_DATA_SOURCES DESC
That produces about 5,800 (with about 60 having 1 or more data sources). So that'd be a good place to start.
@mbodeantor @josh-chamberlain I may need permissions for the database updated so that I have CREATE_TABLE permissions, as I'd like to create a small additional table to keep track of which agencies have had a search performed on them (so I don't run the risk of redoing searches for them). Either that, or I would need the table created for me.
@maxachis You should still have create permissions on the public schema. What issue are you seeing?
@mbodeantor I'm running into this error when attempting to create a table through PGAdmin:
This is via the username of data_sources_app
Yeah you'll have to use your username and pass in your DMs
@mbodeantor Ah that did it. Was able to create the table, but still got denied on adding a foreign key constraint to agencies (which will help me ensure that anything in the cache is actually properly referencing an agency uid). Got the following error:
I believe I need the REFERENCES and possibly the ALTER permissions.
@maxachis I granted you all permissions on that schema, lmk if that doesn't do it.
@mbodeantor Nope, still getting an error of permission denied for table agencies
. Here's the SQL query:
ALTER TABLE agency_url_search_cache
ADD CONSTRAINT FK_AGENCY_UID
FOREIGN KEY (agency_airtable_uid)
REFERENCES public.agencies(airtable_uid);
Draft version of PR updated. Remaining tasks:
@josh-chamberlain The current version of this involves direct database access -- specifically, it updates an in-database "cache", agency_url_search_cache
that indicates what agencies have been searched for, so that it doesn't duplicate searches. It directly references the agencies
table as a foreign key.
There are other ways to do this. I could maintain the cache in the repository, or I could build an endpoint.
The repository provides the most immediate solution to this, but if we want to run this regularly, it will need to be committed regularly, providing commit clutter*. It'll also be less atomic, meaning we could lose a whole day of Google searches if something breaks. I don't think an in-repo cache is the best option.
Developing an endpoint means we'd have to build a v2 or post-v2 endpoint, and couldn't deploy it until then. That means its development is contingent on the development of those endpoints as well.
Third option is we just have it continue to interface directly with the database. Simplest, but goes against our attempts to keep all database interactions within the endpoint.
Let me know your thoughts.
*Even more than my current granular commits add clutter!
@maxachis maybe this is an opportunity to make an endpoint for caching searches—this is something we want to do, broadly, anyway—it's worth knowing, for basic context, not only the last time we updated but searched for type of data
(like homepage
) in agency
or jurisdiction
. For that reason, I think it's worth making an endpoint—later on we could abstract it for other types of caches, but it could be bespoke for now.
Can you clarify why it would need to be v2 or post-v2?
Can you clarify why it would need to be v2 or post-v2?
@josh-chamberlain No reason it has to be, per se. I assumed it would be since I figured we were avoiding significant changes to main if not contained within v2 or post-v2. However, if we're comfortable with developing an endpoint in the primary data-source-app
repository, prior to v2, then I'm happy to do that as well. To do so would, however, either necessitate not incorporating some of the changes I'm setting up in v2 (like integration tests) or else duplicating those efforts.
gotcha. that makes sense...I think it'd be better to effectively make a "legacy" endpoint without those tests and soon upgrade it to v2, duplicating effort that way rather than bringing tests into v1. We can wait until our hand is forced; maintaining a cache in the repo isn't the end of the world as a stopgap. Using the repo feels simpler/less annoying, but I'm not most affected. Your call!
@josh-chamberlain Got it. I will develop a legacy endpoint, probably called \agency-search-cache
, and then have my code make a call to it.
Using the repo feels simpler/less annoying, but I'm not most affected. Your call!
I've seen repos where identical commits were made nearly every day to refresh some data, and man: even if it seems simpler initially, once you start seeing 30, 60, 90 identical commits over the course of months, it makes things clunky.
@josh-chamberlain For calling the endpoint, I required an API key authorization, then realized I'm not sure if I have one!
After doing a bit more research, I realized a few things:
security.py
at the moment (and probably should be modified to something more extensible later).How should we approach?
@maxachis
admin
in the database.
This is related to the solution I proposed in https://github.com/Police-Data-Accessibility-Project/data-cleaning/pull/1, which leveraged using Google Searches to find police agencies.
Given a list of police districts, cities, counties, etc., we could automate a call to the Google Search Engine API or something similar with the string "{Location} Police Agency Website", and then pull the first result. We could start with the most populous cities/counties/etc. in the US and gradually work our way down. Perhaps including additional searches such as searching for all police districts in a location.
The API described above allows 100 free searches per day (or 5$ for 1000). If we create a Github Action, we could easily run that in the background each day and gradually add a trickle of results. We could include either only the first result, or a set of results, depending on how confident we are that the first result will pull what we're looking for.
This would not accrue as many URLs as quickly, but I suspect it will have a higher hit-to-miss ratio for relevant data.