maxachis commented 2 months ago

For methods that create entries in the database (currently only the POST method of /data-sources), it may be useful to do the following:

Break up endpoints more explicitly into sub-endpoints -- one which retrieves entries (i.e., the GET method of /data-sources, which would become /data-sources/all) and one which adds entries (i.e., the POST method of /data-sources, which would become /data-sources/add
Add a GET method to those designed to add entries so that they retrieve all columns which can be created. For example, GET for /data-sources/add would retrieve all columns that can be added via the method, along with a description and their data type.
For all columns which can be added, add comments to those columns within the database, which are returned along with the other information, and which provide the most-up-to-date description of what that column does.

The rationale behind this is the following:

There are a large number of columns associated with entries in the database. Writing these all out in external documentation can be tedious and prone to becoming outdated. Having the columns be retrieved along with the comments can ensure the documentation is a 1:1 match with the database at all times.

The SQL Script involved can be, per ChatGPT:

SELECT 
    cols.table_schema,
    cols.table_name,
    cols.column_name,
    cols.data_type,
    pgd.description AS column_comment
FROM 
    information_schema.columns cols
LEFT JOIN 
    pg_catalog.pg_statio_all_tables AS st 
    ON cols.table_schema = st.schemaname AND cols.table_name = st.relname
LEFT JOIN 
    pg_catalog.pg_description pgd 
    ON pgd.objoid = st.relid AND pgd.objsubid = cols.ordinal_position
WHERE 
    cols.table_schema NOT IN ('information_schema', 'pg_catalog')
    AND cols.table_name = 'your_table_name';  -- Replace 'your_table_name' with the actual table name

maxachis commented 2 months ago

@josh-chamberlain This is one where I'm especially curious about your thinking on the matter. From my vantage point, it should help us to keep documentation up-to-date and dynamic, and ensure we only have a single point of truth for it.

josh-chamberlain commented 2 months ago

@maxachis I responded in Discord, but I don't know that we resolved it to your satisfaction.

I think best practice is for each CRUD function to be an HTTP request on the same endpoint, data-sources GET and data-sources POST. I don’t think /add or /all would be needed.

From what I understand, POST usually edits one, and GET often fetches many; maybe we should have the GET fetch one by ID (like for the data source details page) and use the search endpoint for returning bulk sources. We’re always going to want to attach params to a bulk search, so that endpoint might serve all our needs.

The priority is to make a fairly standard and nicely working API for our users/front end, and accept whatever documentation overhead Swagger leaves us with.

So, I think I disagree with your step 1, but would steps 2 and 3 still work?

maxachis commented 2 months ago

@josh-chamberlain We might need to come up with an alternative endpoint to retrieve detailed column information, if we don't want to use the /add or /all method. Maybe something like data_sources/fields or data_sources/data_dictionary, which could retrieve all the fields per steps 2 and 3. If we assume that all things which work within the data_sources namespace work with the same table and the same set of columns, that'd be doable. We'd just need to make sure that all the fields returned in any of the endpoints have the same names, and we don't have cases where field X is named field Y in one endpoint but field Z in another.

That way, we'd have the information conveyed from the data dictionary documentation in an endpoint, dependent directly on data in the relevant columns.

josh-chamberlain commented 2 months ago

The idea of using an endpoint to let people get columns would mean that only people who can use the API have access to our schema, right?

If we're using swagger/openAPI with flask-restx, don't we get request and response bodies for free when we use annotations in the code? The schema for each endpoint would be in those samples, which can be auto-generated from the model.

I'm sorry to be stubborn—I think your goals for automation and clarity and source-of-truthness make sense, but they seem achievable without any special endpoints provided we annotate properly.

From our GPT friend

Define Your API and Models:

Use Flask-RESTX to define your API endpoints and data models with annotations. This ensures that your Swagger documentation is generated based on your code.

Setting Up Flask-RESTX:

from flask import Flask, request
from flask_restx import Api, Resource, fields

app = Flask(__name__)
api = Api(app, version='1.0', title='Sample API',
          description='A sample API with automatic documentation')

ns = api.namespace('data-sources', description='Data source operations')

data_source = api.model('DataSource', {
    'id': fields.Integer(readOnly=True, description='The unique identifier of a data source'),
    'name': fields.String(required=True, description='The name of the data source'),
    'description': fields.String(description='A brief description of the data source'),
    'created_at': fields.DateTime(description='The creation timestamp')
})

@ns.route('/')
class DataSourceList(Resource):
    @ns.doc('list_data_sources')
    @ns.marshal_list_with(data_source)
    def get(self):
        '''List all data sources'''
        # Logic to list data sources
        return []

    @ns.doc('create_data_source')
    @ns.expect(data_source)
    @ns.marshal_with(data_source, code=201)
    def post(self):
        '''Create a new data source'''
        # Logic to create a data source
        return {}, 201

@ns.route('/<int:id>')
@ns.response(404, 'Data source not found')
@ns.param('id', 'The data source identifier')
class DataSource(Resource):
    @ns.doc('get_data_source')
    @ns.marshal_with(data_source)
    def get(self, id):
        '''Fetch a data source given its identifier'''
        # Logic to fetch a specific data source
        return {}

    @ns.doc('delete_data_source')
    @ns.response(204, 'Data source deleted')
    def delete(self, id):
        '''Delete a data source given its identifier'''
        # Logic to delete a specific data source
        return '', 204

    @ns.expect(data_source)
    @ns.marshal_with(data_source)
    def put(self, id):
        '''Update a data source given its identifier'''
        # Logic to update a specific data source
        return {}

if __name__ == '__main__':
    app.run(debug=True)

Key Points:

Namespace (ns): Defines a namespace for related endpoints. In this case, all data-sources operations are grouped together.
Model (data_source): Defines the data structure for a data source, including fields and their descriptions.
Annotations (@ns.doc, @ns.expect, @ns.marshal_with, @ns.response): Used to document each endpoint. These annotations specify the endpoint's purpose, expected input, and output format.

When you run your Flask application, Flask-RESTX automatically generates the Swagger UI, which you can access (by default) at http://localhost:5000/. This UI will display all your documented endpoints, their expected inputs, outputs, and other details.

Keeping Documentation Up-to-Date:

Consistency in Code: By keeping your API definitions and models in your code with Flask-RESTX, the documentation will always reflect the current state of your API.
Automated Updates: You can integrate the API generation process into your CI/CD pipeline to ensure that any changes to the API are reflected in the documentation.

maxachis commented 2 months ago

If we're using swagger/openAPI with flask-restx, don't we get request and response bodies for free when we use annotations in the code? The schema for each endpoint would be in those samples, which can be auto-generated from the model.

I'm sorry to be stubborn—I think your goals for automation and clarity and source-of-truthness make sense, but they seem achievable without any special endpoints provided we annotate properly.

Please don't apologize for stubbornness! You are the primary restraint on me going crazy and trying to convert the entire codebase to Rust or something like that.

I have been using models quite liberally in the documentation -- the only issue is that those models have to be declared and described explicitly. If the fields we are returning change, whether because the underlying database is modified or we change how we present the fields, we have to remember to update those models as well. And when the models contain upwards of 30 (or 60) possible fields, that can be clunky. I have not, as of yet, found a way to auto-develop the models from the response content alone (although that would be a nifty idea).

marshal_with can definitely help with more simple models, but it has some issues with more complex data models, such as dictionaries with other objects within it. There are other ways we can do marshalling to bypass those issues, but that may require additional third party components such as Marshmallow.

Regardless, I think it's a mild issue on my part that I don't have strong attachment to. If there's a lot of hesitation (for very good reasons, I hasten to add), I'm content to just drop it for now and see how we play out with documenting the models in the existing API documentation, rather than adding an additional endpoint.

Police-Data-Accessibility-Project / data-sources-app

Create/modify endpoint methods that create entries in database to give all columns that can be created. #365

From our GPT friend

Define Your API and Models: