Enhance `data-sources` CRUD endpoints

maxachis commented 2 months ago

Context

In keeping with the changes in #417 and #390, the /data-sources endpoints need to be updated to enforce consistency with new standards and enhancements.

Requirements

Add DELETE endpoint to `data-source
Update /data-sources/GET` endpoint to include pagination and sorting
Refactor other endpoints to bring them in line with new standards and enhancements

Tests

Try to maintain continuity with existing tests for these endpoints

Docs

Swagger documentation will need to be enhanced with latest updates.

Open questions

What do we want to do with /data-sources-needs-identification and /data-sources-map? These can likely be consolidated with filter and column selection logic into the main endpoints (albeit at a performance cost (though perhaps not a substantial one))
For consistency, I'd like to modify the endpoint nomenclature to bring them in line with the other endpoints -- in other words, /data-sources-by-id/{data_source_id} will become /data-sources/id/{data_source_id} and /data-sources will become /data-sources/page/{page_number}. Is this acceptable?

maxachis commented 2 months ago

@josh-chamberlain, as before I'll need your input on the data sources permissions:

column_name	STANDARD	FORM	ADMIN	NOTE
name	READ	NONE	WRITE
submitted_name	READ	NONE	WRITE
description	READ	NONE	WRITE
record_type	READ	NONE	WRITE
source_url	READ	NONE	WRITE
agency_supplied	READ	NONE	WRITE
supplying_entity	READ	NONE	WRITE
agency_originated	READ	NONE	WRITE
agency_aggregation	READ	NONE	WRITE
coverage_start	READ	NONE	WRITE
coverage_end	READ	NONE	WRITE
source_last_updated	READ	NONE	READ
detail_level	READ	NONE	WRITE
number_of_records_available	NONE	NONE	NONE	Deprecate in v2
size	NONE	NONE	NONE	Deprecate in v2
access_type	READ	NONE	WRITE
record_download_option_provided	READ	NONE	WRITE
data_portal_type	READ	NONE	WRITE
record_format	READ	NONE	WRITE
update_method	READ	NONE	WRITE
tags	READ	NONE	WRITE
readme_url	READ	NONE	WRITE
originating_entity	READ	NONE	WRITE
retention_schedule	READ	NONE	WRITE
airtable_uid	READ	NONE	READ
scraper_url	READ	NONE	WRITE
data_source_created	READ	NONE	READ
airtable_source_last_modified	READ	NONE	READ
url_broken	NONE	NONE	NONE	Should already be deprecated
submission_notes	NONE	NONE	READ
rejection_note	READ	NONE	WRITE
last_approval_editor	READ	NONE	READ
submitter_contact_info	NONE	NONE	READ
agency_described_submitted	READ	NONE	WRITE
agency_described_not_in_database	READ	NONE	WRITE
approved	NONE	NONE	NONE	Should already be deprecated
record_type_other	READ	NONE	WRITE
data_portal_type_other	READ	NONE	WRITE
private_access_instructions	NONE	NONE	READ
records_not_online	NONE	NONE	NONE	Should already be deprecated
data_source_request	READ	NONE	READ
url_button	NONE	NONE	NONE	Should not be synced from Airtable
tags_other	READ	NONE	WRITE
broken_source_url_as_of	READ	NONE	WRITE
access_notes	READ	NONE	WRITE
url_status	READ	NONE	WRITE
approval_status	NONE	NONE	WRITE
record_type_id	READ	NONE	WRITE

maxachis commented 2 months ago

@josh-chamberlain Additionally, I note a quirk in our current /data-sources-by-id endpoint which may need figuring out:

The GET method for this returns not just columns from data_sources, but columns from agencies and data_sources_archive_info as well. This adds a few interesting wrinkles:

As is, this logic violates the standard I've set for /data-requests and /agencies, where the GET function returns columns only from one table, and doesn't add columns from other tables.
This confuses the GET and POST/EDIT logic; some of the columns returned by GET will not be columns that can be edited (even if only theoretically, due to access permissionsinPOST/PUT`
While I do see value in connecting these tables together (I would imagine many a person would want to see the names of agencies associated with data requests), these CRUD endpoints might not be the place for them.

My recommendation is to do away with the old method and maintain the one-table-per-endpoint standard I've set for /data-requests and /agencies; what do you think?

josh-chamberlain commented 2 months ago

@maxachis

permission table updated! Again, many of these can come through on form submissions, so they are not STANDARD but neither or they admin-only. Do we need that information somewhere?
I added two properties that can be deprecated in v2, or now—we don't use them.
for url_broken, approved, records_not_online; these should not exist at all, and are relics of old concepts that should probably have been deprecated. Those are not in Airtable or the Data Dictionary.
- url_broken was replaced by url_status
- approved was replaced by approval_status
- records_not_online was replaced by...nothing, or maybe url status
does record_type_id mean we don't need record_type?

Yes, it does—I would argue that agency-level properties are core to describing Data Sources, as they contain baseline information about location and jurisdiction, but that is mostly for searching, and we have an endpoint for that. You also need them on Data Source Details, and we can hit the Agencies endpoint for the info if needed. I am okay with taking your recommendation and keeping things uniform for now (i.e. only returning columns from the table).

maxachis commented 2 months ago

permission table updated! Again, many of these can come through on form submissions, so they are not STANDARD but neither or they admin-only. Do we need that information somewhere?

@josh-chamberlain Yes! I added a "FORM" column to indicate this, although I wonder if we should consider them to have similar logic to Data Requests owners instead and have the column be "OWNER" -- if it's the latter, that may allow me to copy more of the existing logic for data requests.

I suppose it relates to a larger series of questions to be had over data sources (and agencies, to which this also applies):

Does the original submitter of these entries have any special privileges, such as the ability to edit entries they created in the future?
Do we want to require only signed-in users to provide these entries?
Do we want to track which users provide which sources?

Essentially, in addition to having logic for users who are data requestors, do we want logic for users who are data providers? Maybe not for v2, but perhaps for post-v2?

joshuagraber commented 2 months ago

@josh-chamberlain @maxachis Some thoughts here:

As is, this logic violates the standard I've set for /data-requests and /agencies, where the GET function returns columns only from one table, and doesn't add columns from other tables.

Well, this may mean that the DB design needs rethinking. Can't we just associate the agency fields in the data source table with the associated agency? This would be the standard practice for dealing with such a problem, rather than having the client hit 2 endpoints and merge the responses in order to create 1 object.

This confuses the GET and POST/EDIT logic; some of the columns returned by GET will not be columns that can be edited (even if only theoretically, due to access permissionsinPOST/PUT`

I don't think this is a problem. I've worked with plenty of APIs that have different permission structures for different methods. We can bake these entitlements into the users table and encode them into the JWT.

While I do see value in connecting these tables together (I would imagine many a person would want to see the names of agencies associated with data requests), these CRUD endpoints might not be the place for them.

Exactly. Relating them in the DB is the most efficient and standard solution for this.

maxachis commented 2 months ago

@josh-chamberlain After chatting with @joshuagraber , I think he's right and it's worth pulling from a view for GET that includes agency-relevant info -- while clearly demarcating what are attributes from agencies and what are not. I'll think more about how that will work.

I'll think more about the implementation on this. I'll also make a note to include @joshuagraber earlier in these conversations, so there's less changing of horses mid-stream 🐎💦

maxachis commented 2 months ago

@josh-chamberlain @joshuagraber

Incidentally, this means the structure for GET responses will be a bit different. The outer wrapper will be the same, but the interior will change to include a nested agency which contains the agency-derived properties. Just to clearly demarcate what is derived from agencies:

{
    "id": 1
    "agency_id": 2
    "name": "cat"
    "agency": {
        "id": 2
        "name": "Cats Department"
    }
}

Note that agency_id remains at the first level because it is technically an attribute of data sources.

josh-chamberlain commented 2 months ago

thank you @maxachis and @joshuagraber for finding each other and resolving this harmoniously!

maxachis commented 2 months ago

My intent is to work on this once @EvilDrPurple finishes a round of her #303 work. The question of how to implement nested queries with JSON while also taking variable columns (and to ensure that can be extended to other endpoint which may need this in the future as well) is a nontrivial problem. From what I've seen, SQLAlchemy may make that easier.

EvilDrPurple commented 1 month ago

@maxachis

{
    "id": 1
    "agency_id": 2
    "name": "cat"
    "agency": {
        "id": 2
        "name": "Cats Department"
    }
}

The only potential issue with this exact implementation is there are instances (4 to be exact) where a data source has multiple agencies attached to it. You can view them by querying the database like this:

SELECT * FROM
(
  SELECT link_id, airtable_uid, agency_described_linked_uid,
  COUNT(1)OVER(partition by airtable_uid) AS cnt
  FROM agency_source_link
)a
WHERE cnt > 1

maxachis commented 1 month ago

@EvilDrPurple @josh-chamberlain

Oooooh, that's an interesting dynamic. One possible solution is to instead have it be agencies, which returns the results as a list, like:

{
    "id": 1
    "agency_ids": [2, 3]
    "name": "cat"
    "agencies": [
        {
          "id": 2
          "name": "Cats Department"
        },
        {
          "id": 3
          "name": "Felines Forensics"
        },    
     ]
}

At the same time, maybe we don't want that if there are only 4 instances in the database where this occurs. What do you think, @josh-chamberlain?

EvilDrPurple commented 1 month ago

At the same time, maybe we don't want that if there are only 4 instances in the database where this occurs.

It is likely we will add more in the future, as we find more data sources that span multiple departments, such as multi-agency investigations or regional data that lumps in many agencies

josh-chamberlain commented 1 month ago

I can confirm that this will happen more in the future. In fact, the number would be bigger if we stopped using the aggregated agencies. Discussion here: https://github.com/Police-Data-Accessibility-Project/data-sources-app/issues/272

joshuagraber commented 1 month ago

I think @maxachis's suggestion to return an array of associated agencies makes sense in this case. For web app purposes, we can just loop through it and render UI for each of them. And I think for anyone else using the API, the structure of the data is clear that the data is associated with / aggregated from multiple agencies.

maxachis commented 1 month ago

[ ] In addition to the above format, we'll likely also want to include the "count" of the number of agencies returned, much as we do in typical GET MANY requests.

joshuagraber commented 1 month ago

@maxachis FYI, I'm starting on the data source view, which will require agencies returned from this endpoint. Do you think you have space to prioritize this soon, or should I start with some placeholder data for now?

Also, it looks like @vikramrojo is planning (for the future) a link from this view to the data source map, centered on the source, so I'll eventually need the geolocation fields here as well. Not urgent, but while we're here I thought I'd mention it.

joshuagraber commented 1 month ago

In addition to the above format, we'll likely also want to include the "count" of the number of agencies returned, much as we do in typical GET MANY requests.

Yes, I think so.

maxachis commented 1 month ago

@maxachis FYI, I'm starting on the data source view, which will require agencies returned from this endpoint. Do you think you have space to prioritize this soon, or should I start with some placeholder data for now?

Also, it looks like @vikramrojo is planning (for the future) a link from this view to the data source map, centered on the source, so I'll eventually need the geolocation fields here as well. Not urgent, but while we're here I thought I'd mention it.

Oh it's already underway: I've my main fish @EvilDrPurple working on developing it. But it may be germane to use some placeholder data for at least a little bit while we get the final kinks worked out.

joshuagraber commented 1 month ago

Sounds good @maxachis thanks for the update!

EvilDrPurple commented 1 month ago

@joshuagraber @maxachis

@maxachis FYI, I'm starting on the data source view, which will require agencies returned from this endpoint. Do you think you have space to prioritize this soon, or should I start with some placeholder data for now?

Optimistically I should have it submitted by tomorrow, so long as there are no unforeseen problems I run into

maxachis commented 1 month ago

@joshuagraber With @EvilDrPurple's changes merged in, I can start working on this! I'm going to work into incorporating the changes described in #430 at the same time, since I'm in the area and otherwise I'll be repeating my efforts not long after this, so that will extend the time it takes to complete this, but hopefully only by a day or two.

Edit: I already took care of this part, so I'll be working on implementing the enhancements as requested.

Police-Data-Accessibility-Project / data-sources-app