Police-Data-Accessibility-Project / data-source-identification

Scripts for labeling relevant URLs as Data Sources.
MIT License
5 stars 6 forks source link

Create database simulation for testing #65

Closed maxachis closed 3 months ago

maxachis commented 5 months ago

Context

We'd probably benefit from simulating our database in at least some of our integration tests that interface with the PDAP database. As the pipeline becomes more sophisticated (and more hands get on it), we'll want to make sure we can verify proper database interfacing in a replicable manner. Otherwise, we run the risk of seemingly innocuous changes doing unpleasant and sinful things to our database.

There's probably not a way around that which doesn't involve the use of a docker container which our integration tests then connect to. It'd likely to be impractical to run these as often as unit tests, but they should probably be run prior to making any substantive tests.

Requirements

Tests

Docs

Open questions

maxachis commented 5 months ago

@mbodeantor To properly import the schema (using the pg_dump command), I'll need SELECT permissions for all tables and sequences in the table, granted through a script such as:

GRANT SELECT ON ALL TABLES IN SCHEMA public TO max;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO max;

Let me know if that's doable.

mbodeantor commented 5 months ago

@maxachis should be g2g now

maxachis commented 5 months ago

@mbodeantor @josh-chamberlain I've set up a PR for a proof-of-concept for this form of database testing.

Setting this up was quite the PITA, due to the combination of PostgreSQL, some less-than-well-documented python database testing modules, a remote database, and a docker container, to say nothing of having to create a docker-in-docker container to make sure it worked on a rig that wasn't my own. I've done my best to provide ample documentation and streamline where possible.

josh-chamberlain commented 5 months ago

@maxachis this is mainly coming from a place of ignorance, but—does this introduce new variables which don't exist in production? For example, I'm not sure why it has to be a Docker container as opposed to being a clone in all respects. In the past, I've seen a streamlined strategy involving a follower database where the API is pointed there instead of production for the purposes of testing. That's also where the dev environment was sourcing its data.

Also: I think we should be interacting with the database using the API middle layer, rarely (if ever) directly modifying the database. Is this what you had in mind?

maxachis commented 5 months ago

@josh-chamberlain The docker container is intended to be a simple way to simulate the database locally. The container in question simulates a PostgreSQL server, and the related scripts populate it with the production schema. It's effectively a clone except that it's not hosted on Digital Ocean and the Port/DB/Host/etc., naturally, are different. It wouldn't be populated with data -- it would instead be a blank database with the required schema, and any data populated would be limited to the test in question.

A dev database could also be used, although I may need clarification on the preferred database strategy, as some of my recent changes do involve direct updates to the database, and I'm more familiar with standard CRUD operations in SQL than I am with the API, so there would be a learning curve to figuring that out.

mbodeantor commented 5 months ago

@maxachis This is currently accomplished with a sqlite database and a script that loads test data to query against

maxachis commented 5 months ago

@mbodeantor This exists for testing data-sources-app, but nothing comparable exists in data-source-identification.

Additionally, if done with a sqlite database, relevant questions are:

  1. Does the SQLite database sufficiently emulate the PostgreSQL environment? Some scripts, particularly ones developed in the future, that work in one will not work in the other. We do not want to have something pass in development only for it to fail in prod.
  2. Does the SQLite database import the most recent production schema? If not, it may become outdated. Additionally, it is not a straightforward process to import a PostgreSQL schema into SQLite -- data types, syntax, etc. differ between the two. Looking at the current version of the script in data-sources-app, the creation file do_db_ddl_clean.sql is static, creating another variable that must be separately maintained as the database develops.
mbodeantor commented 5 months ago

@maxachis I would say these are not really issues right now and it would make sense to hold off until we see more regular database changes

maxachis commented 5 months ago

@mbodeantor @josh-chamberlain In which case, this problem becomes intertwined with other issues I'm working on, which as designed would involve regular database changes.

My current plan in #76 is one such example. However, there are alternatives to directly interfacing with the production database, such as maintaining repo-level data stores or hugging face.

These come with their own downsides, such as commit-bloat and an extra variable to manage for repo-level data or having to develop a system to manage batch level uploads to huggingface when batch processing fails partway through (without having to restart the whole batch). And in both cases, querying the data on a more atomic basis becomes more fraught unless we end up importing the data into a database anyway.

Regardless, the system, as designed in #76, can produce 1000 rows of data per day. It will require:

  1. A means to avoid repeating searches
  2. A means to queue up queries so that each new set of google searches desired does not require changes to the repository
  3. A means to store the resultant data and be able to sift through it, especially at high volume.

That system can be altered, and the data storage needs correspondingly modified, but as is, I do think cloud database storage, whether through the API or via direct SQL calls, provides the simplest and most robust solution to these requirements.

mbodeantor commented 5 months ago

@maxachis I think creating the new tables in the same database is fine, I would just create a version of the do_db_ddl_clean.sql file for this repo and tests with the relevant table creation scripts.

maxachis commented 5 months ago

@mbodeantor @josh-chamberlain I will proceed with creating a version of do_db_ddl_clean.sql and testing it on sqllite.

However, I will reiterate that in the event of any of the table structures changing in the Digital Ocean server, such scripts will need to be modified as well, or functionality which passes testing in dev will fail in production. There is considerable risk that someone will not realize that the setup sql script and the production environment are out of alignment until the code is pushed to production or else operating on production databases. Additionally, sqlite will not perfectly emulate the PostgreSQL database.

If we don't anticipate table structures changing much, that's not a sizeable risk, and with backups any error should be able to be rolled back. But the risk will remain.

josh-chamberlain commented 4 months ago

@maxachis @mbodeantor I think many of these concerns are aided by following this principle:

...interacting with the database using the API middle layer, rarely (if ever) directly modifying the database.

The endpoints are designed for security, repeatability, data integrity, scalability, and to make sure results are cleaner and more usable (we strip out rejected sources and agencies, etc). This might also help keep our testing in line with already-established patterns.

I left a comment to that effect on #76 with an idea about how to proceed in that specific case—adding a table to the database and a couple endpoints to use the table.

Either way, a test database seems to make sense, and using PostgreSQL + other matching configuration to production seems to make sense...personally, if we're going to make changes here, I'm more a fan of a deployed testing/dev database that matches our production configuration than docker, which introduces a new variable.

Either way, we will need to have a process which encourages us to mirror the production and testing database conditions. There's a procedure for schema changes here, which mostly flow downstream from airtable.

Thoughts? @mbodeantor @maxachis if we keep going back and forth on this, let's just schedule a call to figure it out in one go

maxachis commented 4 months ago

@josh-chamberlain I've no objections to any of this! The primary curveball is that I'll need to familiarize myself with the API before I can put some of this into action, which will slow me down temporarily as I acclimate to that.

In terms of the procedure for schema changes, it may make sense to set up a Github Action where we can pass it SQL schema changes and have it automatically run them on approval, or something comparable. This would help make the changes a bit more concrete by including some self-documenting code.

mbodeantor commented 4 months ago

@maxachis @josh-chamberlain I agree the simplest way to create a dev database would just be to replicate the prod one and then just have the airtable mirror also update dev. This won't capture any direct changes to the prod db, but this should be sufficient for test data and process should be to update dev as well.

For the API endpoints, we should be able to repurpose the /data-sources get and post endpoints for the join to agencies.

maxachis commented 4 months ago

This is now being addressed in https://github.com/Police-Data-Accessibility-Project/data-sources-app/issues/271 and hence can be closed here.

josh-chamberlain commented 3 months ago

just reopening until the PR is merged; i'll adjust the PR so that it'll auto-close the issue

josh-chamberlain commented 3 months ago

oh, I see.