maxachis commented 1 week ago

Currently, middleware code includes functions which interface directly with the database as well as functions which deal with the outputs from these database functions
Database-interfacing objects, such as cursors or connections, often pass through these intermediate functions before finally being used at the very end of a chain of functions.
When type hinting and documentation is involved, any change to these objects must then be reflected in the type hints and docstrings of every function they pass through.
This also adds clutter in function definitions, which in turn leads to clutter in testing
Finally, testing becomes more difficult because different kinds of functionality are located within the same module; the testing needs of something which interfaces with the database is distinct from that of middleware functions which call it. Having these located within the same files can cause confusion.

The solution to this is to create a Database Client which can encapsulate the logic for database interactions in a single class, which can be modified, mocked, and tested much more easily than a wide variety of disparate functions.

For the moment, this can still be passed through all the middleware functions, but those functions will likely be targeted for refactoring in the future soon as well.

TODO:

[x] Develop Database Client with existing middleware database logic.
[ ] Integrate DatabaseClient into existing methods and tests
[ ] Merge into dev.
[ ] Create new issue for refactoring some existing tests, so that only integration and db_client tests interact with the live database, and everything else uses mocks.
[ ] Similarly, create new issue for adding middleware tests to fill in gaps in existing coverage that will result from this separation.
[ ] Create new issue for improving some existing database client tests which, due to their nature, may require more complex logic than their current tests but should be handled separate to avoid bloat in this issue. See "Tests To Update", below.
[ ] Consider creating a separate issue for breaking up the query text itself into a separate file. See "Module for Query Text?"

josh-chamberlain commented 1 week ago

@maxachis

I think I'm following most of what you're saying here. I like the idea of separation of concerns, and this makes conceptual sense to me as a way to more cleanly govern logic about privacy and which data we return when.

I'm going to attempt to restate this to check if I understand:

in defining an API endpoint we're both fetching and parsing data with middleware functions
- then, we're using resources to define the actual endpoint
the idea is to add a database client (probably in utilities or something)
- i'm guessing certain common, generic data fetching tasks would happen here, filtered through our opinions about how data should be parsed (datetime uniformity, sources of truth for properties, those nice things)
then, our middleware's only responsible for parsing and business logic
documentation is simpler, and needs to be updated less frequently

Anything you'd change about that? If so, do you think you could provide an example to help illustrate?

Some example cases for discussion:

We have a definition of "public properties" and "hidden properties"—can we define that in one place in the code so endpoints are less likely to expose stuff we don't want exposed?
Same for "public data" vs "hidden data", i.e. almost 100% of the time we only show anyone data sources/agencies with approval_status is approved and url_status is ok. The only times we want sources without those properties are for workflows meant to approve submissions or check status automatically.

Would we handle that logic in the db client? middleware?

maxachis commented 1 week ago

@josh-chamberlain Let me tell you about what I've been developing, captain! This should help give an idea of how I'm thinking about this.

class DatabaseClient:

    def __init__(self, cursor: psycopg2.extensions.cursor):
        self.cursor = cursor

    def add_new_user(self, email: str, password_digest: str):
        self.cursor.execute(
            f"insert into users (email, password_digest) values (%s, %s)",
            (email, password_digest),
        )

    def get_user_id(self, email: str):
        self.cursor.execute(
            f"select id from users where email = %s", (email,)
        )
        return self.cursor.fetchone()[0]

The DatabaseClient will be a class devoted to performing one type of action -- interacting with the database. It will contain methods which take as parameters only what is needed to perform a particular database action, and will have as attributes only what is needed to interface with the database. Tests on the Database Client will also be streamlined -- each one will test only that it performs its intended action as expected with the live database.

Here is how it's currently being implemented in my working branch:


def user_post_results(db_client: DatabaseClient, email: str, password: str) -> None:
    """
    Creates a new user with the provided email and password.

    :param db_client: A DatabaseClient object.
    :param email: The email address of the new user.
    :param password: The password for the new user.
    """
    password_digest = generate_password_hash(password)
    db_client.add_new_user(email, password_digest)

The DBClient object is passed to a given function. It abstracts the details of how it works with the database -- meaning we can change those details (for example as in #303) without worrying about the other logic being affected. Logic not requiring interaction with the database, such as password generation logic, is kept outside of the DBClient. I'm not sure if I would ultimately want to have the DBClient be passed into the functions, as opposed to it being a dependency injected into a class, but that's for a later discussion.

We have a definition of "public properties" and "hidden properties"—can we define that in one place in the code so endpoints are less likely to expose stuff we don't want exposed? Same for "public data" vs "hidden data", i.e. almost 100% of the time we only show anyone data sources/agencies with approval_status is approved and url_status is ok. The only times we want sources without those properties are for workflows meant to approve submissions or check status automatically.

The Database Client would definitely be a step towards being able to define these more consistently. If/when we migrate to an ORM as in #303, we'd be able to be more explicit within our code about what would constitute public versus hidden properties of certain tables (and I can explain more about what I mean by that if you'd like).

However, in some cases, some of that may make more sense being done on the database side in a separate issue. For example, we could apply column masking to ensure that certain roles cannot view certain columns in certain tables. We could break up certain tables so that the hidden attributes are in a separate linked table with different access permissions. We could create views which conceal certain columns. We could implement database-side encryption. And so on.

josh-chamberlain commented 1 week ago

@maxachis nice, thanks for talking it through. this makes sense!

maxachis commented 1 week ago

@EvilDrPurple Will be teaming up with me on this one!

maxachis commented 2 days ago

@josh-chamberlain I know you wanted me to prioritize #249 but since this was in the process of being worked on, and this work will necessarily overlap with the work in that issue, I think it makes sense to complete this before going more deeply into that issue. It should only take me a few days, aided considerably by @EvilDrPurple 's help on the matter. If, however, you would like me to put that issue on ultra-priority, I'll happily redirect attention.

josh-chamberlain commented 2 days ago

@maxachis I think this is important for building new endpoints anyway, all good! Thank you

maxachis commented 2 days ago

Tests To Update

test_get_approved_data_sources: Currently tests that it returns a nonzero amount of data and that it returns the expected number of columns. However, test doesn't currently test if the data sources returned are approved or not. Considering this query can return upwards of 2000 data sources, it may be impractical to insert fake approved/un-approved data and confirm that only the expected data sources are returned, particularly as the number of data sources grow. May require testing on an empty set of tables, but unsure.

maxachis commented 2 days ago

Module For Query Text?

The database client is quite a large beast, currently possessing over 700 lines of code. A lot of this is the text for each of the SQL queries. It may be prudent to break this up into another separate module which calls these queries, so as to further separate the logic between producing the SQL text and the boilerplate code to actually execute it.

However, while I am leaning towards this, I'm not 100% sold on the necessity of doing this for the following reasons:

I don't want to have an already substantial pull request take longer than it currently is.
For some smaller queries, or those that involve parameters, this may make the code less clear than simply having it in one place. For example, it may be more trouble than its worth to extract the following query text:
```
self.cursor.execute(
        f"select id, token from access_tokens where token = %s", (api_key,)
    )
```

Note, however, that I did create a DynamicQueryConstructor class for handling the cases where queries are dynamically generated, as in the case of get_approved_data_sources, because that logic I think merited separation at this stage.

josh-chamberlain commented 2 days ago

Module for Query Text?

Some files are just long, and that's ok; it seems there's a benefit to working in one file. I don't think it's worth optimizing until we've used it a bit, at which point it will be more clear what to do. Agreed about the dynamic query constructor.

Approval test

We could include the approval status in the properties returned by the endpoint, and throw a flag if we get any that aren't approved?

maxachis commented 1 day ago

Approval test

We could include the approval status in the properties returned by the endpoint, and throw a flag if we get any that aren't approved?

@josh-chamberlain That's definitely one of the quicker ways to do it in terms of development, but it would mean we'd have to check each one individually, which would impact performance more as the number of data sources increase. On the other hand, if we can test it outside of the function, we'd be able to do so in a way that is less variable depending on how many results we have.

The main challenge is that testing to see that queries returning large amounts of data perform as expected in a database that mirrors production, and testing to see that those queries are correct in terms of the results they return, are tests that are somewhat at odds with each other in terms of the environments that work best for them. This challenge will hold true for all our queries which can return a large number of results.

One solution (among several) is to have two versions of the databases in stage and sandbox:

One which contains all the data from the production database (exempting sensitive tables in sandbox)
One which contains the schemas but no data, and where data is expected to be removed at the end of every test.

1 enables us to test that queries perform as expected in production-similar environment, while 2 enables us to test for query correctness using carefully-crafted test cases without worrying about interference or performance loads.

This wouldn't require additional droplets -- just additional databases in the same servers. Because they're meant to be empty, they wouldn't take much additional space and therefore are unlikely to substantially impact performance. And because I already have experience setting these up, they wouldn't take too much time to modify the migration schemas.

But it is development time and overhead and would require modified testing logic. Hence I'm not advocating this as something we need to do right away or anything. But if we want to cover certain test cases without having the performance of the tests be impacted by the number of data in the database, this is an option that would do that.

I should additionally add that right now, the time to execute a test that iterates through results and checks if certain results are/aren't present isn't all that onerous -- at most a second or two, and often not even that. So while I think this or something like it is warranted as our data sources grow, it's not an imminent concern.

Police-Data-Accessibility-Project / data-sources-app

Create Database Client to Encapsulate Database Interactions #337

TODO:

Tests To Update

Module For Query Text?