Open josh-chamberlain opened 6 months ago
[ ] create a new table for
source-agency
- properties include
id
source_id
agency_id
- for convenience, we may want to normalize
geographic
properties into this database from agencies
@josh-chamberlain I'd argue against include geo
here because it'd constitute a duplication of information. After all, no matter what data source links to an agency, the geo would be the same, correct?
@maxachis yes—unless it makes a search faster it's not worth considering.
@josh-chamberlain So we already have an agency_source_link
table. I'm not sure when it was created or what scripts use it, or how we didn't notice it before, but it's populated with data.
Unfortunately, that data is not wholly accurate. Using the below script:
with a as (
SELECT
agency_described_linked_uid,
count(airtable_uid) cnt
FROM agency_source_link
GROUP BY agency_described_linked_uid
)
SELECT
a.agency_described_linked_uid,
a.cnt,
agencies.count_data_sources
from a
left join agencies on a.agency_described_linked_uid = agencies.airtable_uid
ORDER by cnt desc
There are discrepancies between how many data sources agencies.count_data_source
says it possesses, and how many are referenced in the linked table.
So a few things from this:
agencies.data_sources
). Then we'll need to figure out when/where/how to apply it.agency_source_link
or not, we probably need to delete agencies.data_sources
and agencies.count_data_sources
, as both will be determinable from the link table, agency_described
in data_sources
(which consists of both agency_described_submitted
and agency_described_not_in_database
)I'm not sure, either.
Some of these fields—like the ones counting agencies/sources associated with each other—come from Airtable.
@josh-chamberlain After doing some more digging, it seems that it's utilized by data_source_queries.py
and quick_search_query.py
, and in fact it's doing a lot of what we want it to do, pulling agencies and the associated data sources.
Which makes some of our existing logic more confusing! Properties such as agency_described_submitted
, or agencies.data_sources
, for example, aren't actually referenced in any SQL query in data-sources-app
. They appear to be completely superfluous to the app's functioning. If they're just included because Airtable includes them, I feel there's little reason to keep them on our end -- they just cause confusion and make it harder to know what ground truth is.
I feel I'll likely need to understand how agency_source_link
is working, which might be a @mbodeantor question. That will inform how to approach all this.
@maxachis Yeah this is all related to airtable. The link between agencies and data_sources in Airtable is not accessible from their API as table, so agency_source_link was created as a way to store that and updated by the mirror script. It would not make sense to maintain it exactly in its current form for a v2 without Airtable.
data_sources and count_data_sources are sourced directly from Airtable and are probably not ever updated as far as I know.
agencies.data_sources
and agencies.county_data_sources
are "magical airtable fields" which keep track of these links; they will be superfluous.
Airtable's "manage fields" feature is helpful to disambiguate some of this if I'm not around: https://airtable.com/app473MWXVJVaD7Es/tblx8XaKnFTphWNQM/viw9mmOR0fw8HFOje?blocks=hide&fieldManager=true
I have updated the parent issue after feedback from this thread and thinking about aggregated agencies in general. I renamed the issue to reflect the broader scope, and the general intent.
related:
Context
agency_described
property in the data sources db, which can accept multiple agenciesAllegheny County aggregated
which are not agencies but geographic descriptions of jurisdictionagency_type="aggregated"
to account for this in the database today; 56 sources affected as of 7/10/24Requirements
source-agency
id
source_id
agency_id
agency_described
propertygeography_described
property to data_sources (described in the submission workflow below)Front end experience
Submission workflow:
geography_described
and fetch them dynamically when people make a search matching that geography. it seems hard to maintain otherwise.↑ in any case, we end up with a bunch of
source-agency
rows, which can be used in searches. when we make this front-end selection of many agencies at once possible, we can do away withaggregated agencies
.Search workflow:
↑ the searcher doesn't care which db properties make sources show up