Repository for HFP (High-frequency positioning) analytic tools. More info about HFP https://digitransit.fi/en/developers/apis/4-realtime-api/vehicle-positions/.
Currently, we have no test automation to ensure that old and new features of this tools work as expected whenever new commits or PR:s are introduced. I think we should start creating automated tests soon enough, before we advance too far without them, thus taking technical debt. This is particularly important if this tool starts to have a larger role than just small-scale data quality monitoring in the near future.
At this point, I'm not suggesting detailed unit tests for Python stuff, although it's important to recognize if they are useful at some point later. As for the database model, it luckily "tests itself" at this level, in a way: when started from scratch without an existing db volume, the db service fails if any DDL SQL statement fails, e.g. due to incorrectly named foreign key field or syntax error in a view definition.
Rather, I would first automate the most obvious cases that we currently want to test manually: importing raw data into the database, transforming it (stop correspondence analysis for now), and checking if it looks like we expect. I'd split the tests into two categories to run either one or both of them:
1) Integration from Azure Blob Storage raw data files to database import -> we have an external dependency and need for network I/O, and the external data can (in theory) change or be deleted over time. Not something that I would like to run every minute as a developer, but this could be run by GitHub actions as a PR check, for example. Purpose: to ensure that the Blob Storage specific things in our tool work correctly.
2) Integration from raw .zst files to database import and, further, normalized and analyzed data inside the database, finally queried from the API. Something that I would like to be able to run frequently -> should be quick and easy locally. Could be based on a minimal set of .zst files in testdata/, included in version control too. Also run by GitHub actions as a PR check. Purpose: to ensure that the data flow inside the tool works correctly, all the way from raw data to how analysis results look like through the API.
At least case 2 means multiple test cases and files, of course, but we can start small now and preferably grow the test suite in sync with actual code, whenever new features are introduced. Perhaps even in a TDD way.
Currently, we have no test automation to ensure that old and new features of this tools work as expected whenever new commits or PR:s are introduced. I think we should start creating automated tests soon enough, before we advance too far without them, thus taking technical debt. This is particularly important if this tool starts to have a larger role than just small-scale data quality monitoring in the near future.
At this point, I'm not suggesting detailed unit tests for Python stuff, although it's important to recognize if they are useful at some point later. As for the database model, it luckily "tests itself" at this level, in a way: when started from scratch without an existing db volume, the
db
service fails if any DDL SQL statement fails, e.g. due to incorrectly named foreign key field or syntax error in a view definition.Rather, I would first automate the most obvious cases that we currently want to test manually: importing raw data into the database, transforming it (stop correspondence analysis for now), and checking if it looks like we expect. I'd split the tests into two categories to run either one or both of them:
1) Integration from Azure Blob Storage raw data files to database import -> we have an external dependency and need for network I/O, and the external data can (in theory) change or be deleted over time. Not something that I would like to run every minute as a developer, but this could be run by GitHub actions as a PR check, for example. Purpose: to ensure that the Blob Storage specific things in our tool work correctly. 2) Integration from raw .zst files to database import and, further, normalized and analyzed data inside the database, finally queried from the API. Something that I would like to be able to run frequently -> should be quick and easy locally. Could be based on a minimal set of .zst files in
testdata/
, included in version control too. Also run by GitHub actions as a PR check. Purpose: to ensure that the data flow inside the tool works correctly, all the way from raw data to how analysis results look like through the API.At least case 2 means multiple test cases and files, of course, but we can start small now and preferably grow the test suite in sync with actual code, whenever new features are introduced. Perhaps even in a TDD way.