Police-Data-Accessibility-Project / data-sources-app

An API and UI for using and maintaining the Data Sources database
MIT License
3 stars 5 forks source link

Enhance `data-sources` CRUD endpoints #429

Closed maxachis closed 3 days ago

maxachis commented 2 months ago

Context

In keeping with the changes in #417 and #390, the /data-sources endpoints need to be updated to enforce consistency with new standards and enhancements.

Requirements

Related

Tests

Docs

Open questions

maxachis commented 2 months ago

@josh-chamberlain, as before I'll need your input on the data sources permissions:

column_name STANDARD FORM ADMIN NOTE
name READ NONE WRITE
submitted_name READ NONE WRITE
description READ NONE WRITE
record_type READ NONE WRITE
source_url READ NONE WRITE
agency_supplied READ NONE WRITE
supplying_entity READ NONE WRITE
agency_originated READ NONE WRITE
agency_aggregation READ NONE WRITE
coverage_start READ NONE WRITE
coverage_end READ NONE WRITE
source_last_updated READ NONE READ
detail_level READ NONE WRITE
number_of_records_available NONE NONE NONE Deprecate in v2
size NONE NONE NONE Deprecate in v2
access_type READ NONE WRITE
record_download_option_provided READ NONE WRITE
data_portal_type READ NONE WRITE
record_format READ NONE WRITE
update_method READ NONE WRITE
tags READ NONE WRITE
readme_url READ NONE WRITE
originating_entity READ NONE WRITE
retention_schedule READ NONE WRITE
airtable_uid READ NONE READ
scraper_url READ NONE WRITE
data_source_created READ NONE READ
airtable_source_last_modified READ NONE READ
url_broken NONE NONE NONE Should already be deprecated
submission_notes NONE NONE READ
rejection_note READ NONE WRITE
last_approval_editor READ NONE READ
submitter_contact_info NONE NONE READ
agency_described_submitted READ NONE WRITE
agency_described_not_in_database READ NONE WRITE
approved NONE NONE NONE Should already be deprecated
record_type_other READ NONE WRITE
data_portal_type_other READ NONE WRITE
private_access_instructions NONE NONE READ
records_not_online NONE NONE NONE Should already be deprecated
data_source_request READ NONE READ
url_button NONE NONE NONE Should not be synced from Airtable
tags_other READ NONE WRITE
broken_source_url_as_of READ NONE WRITE
access_notes READ NONE WRITE
url_status READ NONE WRITE
approval_status NONE NONE WRITE
record_type_id READ NONE WRITE
maxachis commented 2 months ago

@josh-chamberlain Additionally, I note a quirk in our current /data-sources-by-id endpoint which may need figuring out:

The GET method for this returns not just columns from data_sources, but columns from agencies and data_sources_archive_info as well. This adds a few interesting wrinkles:

  1. As is, this logic violates the standard I've set for /data-requests and /agencies, where the GET function returns columns only from one table, and doesn't add columns from other tables.
  2. This confuses the GET and POST/EDIT logic; some of the columns returned by GET will not be columns that can be edited (even if only theoretically, due to access permissionsinPOST/PUT`
  3. While I do see value in connecting these tables together (I would imagine many a person would want to see the names of agencies associated with data requests), these CRUD endpoints might not be the place for them.

My recommendation is to do away with the old method and maintain the one-table-per-endpoint standard I've set for /data-requests and /agencies; what do you think?

josh-chamberlain commented 2 months ago

@maxachis

Yes, it does—I would argue that agency-level properties are core to describing Data Sources, as they contain baseline information about location and jurisdiction, but that is mostly for searching, and we have an endpoint for that. You also need them on Data Source Details, and we can hit the Agencies endpoint for the info if needed. I am okay with taking your recommendation and keeping things uniform for now (i.e. only returning columns from the table).

maxachis commented 2 months ago
  • permission table updated! Again, many of these can come through on form submissions, so they are not STANDARD but neither or they admin-only. Do we need that information somewhere?

@josh-chamberlain Yes! I added a "FORM" column to indicate this, although I wonder if we should consider them to have similar logic to Data Requests owners instead and have the column be "OWNER" -- if it's the latter, that may allow me to copy more of the existing logic for data requests.

I suppose it relates to a larger series of questions to be had over data sources (and agencies, to which this also applies):

Essentially, in addition to having logic for users who are data requestors, do we want logic for users who are data providers? Maybe not for v2, but perhaps for post-v2?

joshuagraber commented 2 months ago

@josh-chamberlain @maxachis Some thoughts here:

As is, this logic violates the standard I've set for /data-requests and /agencies, where the GET function returns columns only from one table, and doesn't add columns from other tables.

Well, this may mean that the DB design needs rethinking. Can't we just associate the agency fields in the data source table with the associated agency? This would be the standard practice for dealing with such a problem, rather than having the client hit 2 endpoints and merge the responses in order to create 1 object.

This confuses the GET and POST/EDIT logic; some of the columns returned by GET will not be columns that can be edited (even if only theoretically, due to access permissionsinPOST/PUT`

I don't think this is a problem. I've worked with plenty of APIs that have different permission structures for different methods. We can bake these entitlements into the users table and encode them into the JWT.

While I do see value in connecting these tables together (I would imagine many a person would want to see the names of agencies associated with data requests), these CRUD endpoints might not be the place for them.

Exactly. Relating them in the DB is the most efficient and standard solution for this.

maxachis commented 2 months ago

@josh-chamberlain After chatting with @joshuagraber , I think he's right and it's worth pulling from a view for GET that includes agency-relevant info -- while clearly demarcating what are attributes from agencies and what are not. I'll think more about how that will work.

I'll think more about the implementation on this. I'll also make a note to include @joshuagraber earlier in these conversations, so there's less changing of horses mid-stream 🐎💦

maxachis commented 2 months ago

@josh-chamberlain @joshuagraber

Incidentally, this means the structure for GET responses will be a bit different. The outer wrapper will be the same, but the interior will change to include a nested agency which contains the agency-derived properties. Just to clearly demarcate what is derived from agencies:

{
    "id": 1
    "agency_id": 2
    "name": "cat"
    "agency": {
        "id": 2
        "name": "Cats Department"
    }
}

Note that agency_id remains at the first level because it is technically an attribute of data sources.

josh-chamberlain commented 2 months ago

thank you @maxachis and @joshuagraber for finding each other and resolving this harmoniously!

maxachis commented 2 months ago

My intent is to work on this once @EvilDrPurple finishes a round of her #303 work. The question of how to implement nested queries with JSON while also taking variable columns (and to ensure that can be extended to other endpoint which may need this in the future as well) is a nontrivial problem. From what I've seen, SQLAlchemy may make that easier.

EvilDrPurple commented 1 month ago

@maxachis

{
    "id": 1
    "agency_id": 2
    "name": "cat"
    "agency": {
        "id": 2
        "name": "Cats Department"
    }
}

The only potential issue with this exact implementation is there are instances (4 to be exact) where a data source has multiple agencies attached to it. You can view them by querying the database like this:

SELECT * FROM
(
  SELECT link_id, airtable_uid, agency_described_linked_uid,
  COUNT(1)OVER(partition by airtable_uid) AS cnt
  FROM agency_source_link
)a
WHERE cnt > 1
maxachis commented 1 month ago

@EvilDrPurple @josh-chamberlain

Oooooh, that's an interesting dynamic. One possible solution is to instead have it be agencies, which returns the results as a list, like:

{
    "id": 1
    "agency_ids": [2, 3]
    "name": "cat"
    "agencies": [
        {
          "id": 2
          "name": "Cats Department"
        },
        {
          "id": 3
          "name": "Felines Forensics"
        },    
     ]
}

At the same time, maybe we don't want that if there are only 4 instances in the database where this occurs. What do you think, @josh-chamberlain?

EvilDrPurple commented 1 month ago

At the same time, maybe we don't want that if there are only 4 instances in the database where this occurs.

It is likely we will add more in the future, as we find more data sources that span multiple departments, such as multi-agency investigations or regional data that lumps in many agencies

josh-chamberlain commented 1 month ago

I can confirm that this will happen more in the future. In fact, the number would be bigger if we stopped using the aggregated agencies. Discussion here: https://github.com/Police-Data-Accessibility-Project/data-sources-app/issues/272

joshuagraber commented 1 month ago

I think @maxachis's suggestion to return an array of associated agencies makes sense in this case. For web app purposes, we can just loop through it and render UI for each of them. And I think for anyone else using the API, the structure of the data is clear that the data is associated with / aggregated from multiple agencies.

maxachis commented 1 month ago
joshuagraber commented 1 month ago

@maxachis FYI, I'm starting on the data source view, which will require agencies returned from this endpoint. Do you think you have space to prioritize this soon, or should I start with some placeholder data for now?

Also, it looks like @vikramrojo is planning (for the future) a link from this view to the data source map, centered on the source, so I'll eventually need the geolocation fields here as well. Not urgent, but while we're here I thought I'd mention it.

joshuagraber commented 1 month ago

In addition to the above format, we'll likely also want to include the "count" of the number of agencies returned, much as we do in typical GET MANY requests.

Yes, I think so.

maxachis commented 1 month ago

@maxachis FYI, I'm starting on the data source view, which will require agencies returned from this endpoint. Do you think you have space to prioritize this soon, or should I start with some placeholder data for now?

Also, it looks like @vikramrojo is planning (for the future) a link from this view to the data source map, centered on the source, so I'll eventually need the geolocation fields here as well. Not urgent, but while we're here I thought I'd mention it.

Oh it's already underway: I've my main fish @EvilDrPurple working on developing it. But it may be germane to use some placeholder data for at least a little bit while we get the final kinks worked out.

joshuagraber commented 1 month ago

Sounds good @maxachis thanks for the update!

EvilDrPurple commented 1 month ago

@joshuagraber @maxachis

@maxachis FYI, I'm starting on the data source view, which will require agencies returned from this endpoint. Do you think you have space to prioritize this soon, or should I start with some placeholder data for now?

Optimistically I should have it submitted by tomorrow, so long as there are no unforeseen problems I run into

maxachis commented 1 month ago

@joshuagraber With @EvilDrPurple's changes merged in, I can start working on this! I'm going to work into incorporating the changes described in #430 at the same time, since I'm in the area and otherwise I'll be repeating my efforts not long after this, so that will extend the time it takes to complete this, but hopefully only by a day or two.

Edit: I already took care of this part, so I'll be working on implementing the enhancements as requested.