Police-Data-Accessibility-Project / data-sources-app

An API and UI for using and maintaining the Data Sources database
MIT License
3 stars 5 forks source link

Allow selection of many agencies for a data source #272

Open josh-chamberlain opened 6 months ago

josh-chamberlain commented 6 months ago

related:

Context

Requirements

Front end experience

Submission workflow:

↑ in any case, we end up with a bunch of source-agency rows, which can be used in searches. when we make this front-end selection of many agencies at once possible, we can do away with aggregated agencies.

Search workflow:

↑ the searcher doesn't care which db properties make sources show up

maxachis commented 5 months ago
  • [ ] create a new table for source-agency

    • properties include id source_id agency_id
    • for convenience, we may want to normalize geographic properties into this database from agencies

@josh-chamberlain I'd argue against include geo here because it'd constitute a duplication of information. After all, no matter what data source links to an agency, the geo would be the same, correct?

josh-chamberlain commented 5 months ago

@maxachis yes—unless it makes a search faster it's not worth considering.

maxachis commented 5 months ago

@josh-chamberlain So we already have an agency_source_link table. I'm not sure when it was created or what scripts use it, or how we didn't notice it before, but it's populated with data.

Unfortunately, that data is not wholly accurate. Using the below script:

with a as (
    SELECT 
    agency_described_linked_uid,
    count(airtable_uid) cnt
    FROM agency_source_link
    GROUP BY agency_described_linked_uid    
)
SELECT
    a.agency_described_linked_uid,
    a.cnt,
    agencies.count_data_sources
from a
left join agencies on a.agency_described_linked_uid = agencies.airtable_uid
ORDER by cnt desc

There are discrepancies between how many data sources agencies.count_data_source says it possesses, and how many are referenced in the linked table.

So a few things from this:

  1. Unclear what agency_source_link is currently used for, or its history
  2. We'll need to create an OTS that rebuilds the link table using whatever the ground truth is (I assume agencies.data_sources). Then we'll need to figure out when/where/how to apply it.
  3. Whether we keep agency_source_link or not, we probably need to delete agencies.data_sources and agencies.count_data_sources, as both will be determinable from the link table,
  4. Those deletions are in addition to agency_described in data_sources (which consists of both agency_described_submitted and agency_described_not_in_database)
josh-chamberlain commented 5 months ago

I'm not sure, either.

Some of these fields—like the ones counting agencies/sources associated with each other—come from Airtable.

maxachis commented 5 months ago

@josh-chamberlain After doing some more digging, it seems that it's utilized by data_source_queries.py and quick_search_query.py, and in fact it's doing a lot of what we want it to do, pulling agencies and the associated data sources.

Which makes some of our existing logic more confusing! Properties such as agency_described_submitted, or agencies.data_sources, for example, aren't actually referenced in any SQL query in data-sources-app. They appear to be completely superfluous to the app's functioning. If they're just included because Airtable includes them, I feel there's little reason to keep them on our end -- they just cause confusion and make it harder to know what ground truth is.

I feel I'll likely need to understand how agency_source_link is working, which might be a @mbodeantor question. That will inform how to approach all this.

mbodeantor commented 5 months ago

@maxachis Yeah this is all related to airtable. The link between agencies and data_sources in Airtable is not accessible from their API as table, so agency_source_link was created as a way to store that and updated by the mirror script. It would not make sense to maintain it exactly in its current form for a v2 without Airtable.

data_sources and count_data_sources are sourced directly from Airtable and are probably not ever updated as far as I know.

josh-chamberlain commented 5 months ago

agencies.data_sources and agencies.county_data_sources are "magical airtable fields" which keep track of these links; they will be superfluous.

Airtable's "manage fields" feature is helpful to disambiguate some of this if I'm not around: https://airtable.com/app473MWXVJVaD7Es/tblx8XaKnFTphWNQM/viw9mmOR0fw8HFOje?blocks=hide&fieldManager=true

josh-chamberlain commented 3 months ago

I have updated the parent issue after feedback from this thread and thinking about aggregated agencies in general. I renamed the issue to reflect the broader scope, and the general intent.