Batch database queries using bulk_create_manager.py

sjmf commented 1 year ago

Database transaction efficiency can be greatly improved by batching database queries instead of running individual INSERT statements for every model created by the Django ORM. This is a supported feature in Django, and I have added a helper class to support this (bulk_create_manager.py). I implemented this in the CSV reader process of the "Load parameters" task, and the speed-up was approximately 5x (~07:20 before, ~02:12 for batches of 10k queries).

The next obvious gain to be made through this optimisation procedure is in the "Run Flood Model" task, where every prediction is inserted into the database individually (see: flood_risk.py#95). These could be done in batch-sizes of 10k where I would predict a similar speedup– potentially optimising a task which currently takes 1h to around 12 minutes.

The aggregate_flood_models_by_size task could be similarly optimised, per flood_risk.py#182.

gernathlub commented 1 year ago

Hi I have an experience with Django optimizations and I'd like to start working on this issue if possible, thanks.

sjmf commented 1 year ago

Hi @gernathlub ! It would be amazing and very welcome if you'd like to contribute, and your Django expertise sounds relevant to what we need. We don't have a CONTRIBUTING.MD yet as you're the first to offer to help on this project. I'll try and put one together.

We'll work via forks and pull requests, as these will run the CI unit tests against your changes. We follow the black coding style which runs on a pre-commit hook in git: you can set that up by running pre-commit install. I will review your changes and we can have a conversation about them.

Our DEVELOPMENT.md guide may be a good place to get you set up and started with the project. Alternately, you can build and run the Docker containers to test any changes.

My working hours are 09:00-17:00 GMT on weekdays so I will respond to issues during these times.

sjmf commented 1 year ago

@gernathlub: Regarding timescales, I will be able to work on this issue in a couple of weeks' time. Do you have an idea of how much time you'd like to dedicate to this and when you plan to work on it? No pressure at all as any contributions are appreciated– if I do get to working on it first I'll ask you for feedback. Please keep an eye on this issue as I'll let you know.

I also wondered if the optimisation strategy I have outlined above sounds like a good approach to you?

gernathlub commented 1 year ago

Hi @sjmf, thank you for welcome and brief introductiuon. I am familiar with the type of workflow, you mentioned, thanks. As I usually work from 8:00 - 16:00 GMT on weekdays, I would like to work on this issue during evenings and weekend, and I'll try to finish it asap (best in next week).

Could I ask whats naming convention you prefer for creating new branches and pull requests?

I didnt have time to fully inspect whole flow, but In my opinion bulk operations are good approach to optimize performance. Also would be good to use bulk delete (just delete() mothod called on whole queryset) in matter of function we are trying to optimize in this issue flood_risk.py#105 . Also we could experiment a bit with batch size (could cause a smaller improvement).

sjmf commented 1 year ago

@gernathlub That's really great, thank you so much.

We don't currently have a set naming structure for branches, but in general feel free to create a feature branch for any work. Something like django-optimisation and so on.

I've just pushed some local uncommitted changes in commit 41a4649 on the devel branch which extend the BulkCreateManager class (bulk_create_manager.py) with the ability to do UPDATE (batch_update()) operations. From a quick "eyeball" profiling of database operations, it looks like SELECT operations take up the most time of any operation now, but it's not immediately apparent to me how these would be batched. *

I completely agree with your comment on bulk delete also: this is a good area to identify for optimisation too. I don't know how often DELETE operations will be made on the database in production, as the line you mention in flood_risk.py (now at line #L117) will only be called when there are existing objects returned in the filter function just above it.

NB: I've been monitoring queries using this command alias: 🙃

alias pqmon='while :; do clear && date && docker compose exec -e "PGPASSWORD=manyfews" postgres psql -U manyfews manyfews -c "SELECT datid, pid, LEFT(query,101) AS query, usename, application_name FROM pg_stat_activity;"; sleep 5; done;'

gernathlub commented 1 year ago

@sjmf I see we had similar idea. In mean time I created very similar change but with a bit different approach so I created PR (marked as WIP as other features will be added) so you can decide which solution you like better 😃. Tommorow I can add some tests as I noticed this method isnt covered yet.

Quick question regarding tests, have you consider adding some fixtures or factory for creating db objects while testing?

sjmf commented 1 year ago

I like your change, it is less verbose than mine but yes very similar.

Regarding tests, that would be very much appreciated. It is definitely possible to export some data into a fixtures file to run tests of these functions on. I will look into that as soon as I have time over the coming week. I have already been using fixtures locally as a method of populating the database with some default data in my Docker development environment, so should be easy to replicate that approach.

gernathlub commented 1 year ago

@sjmf Thanks. Sorry I didnt had much time lately to add promissed tests (I will try to do it at wednesday). In a mean time, could you please check the PR I mentioned before, if you want me to add/change something? Thanks

sjmf commented 1 year ago

@gernathlub – I had a look at the PR, and it looks fine. I have not merged it yet as I assume you would like me to wait for feature completion as you marked it as WIP.

I have not yet got to adding fixture data– my apologies. I have been working on our continuous deployment in Azure and have a deadline coming up for that at the end of the week.

Do you have any suggestions for how we could also optimise SELECT statements?

gernathlub commented 1 year ago

@sjmf Thanks and sorry that adding tests takes me so long. Regarding optimising SELECT statements, there are many ways to do it depending on the situation. Do you have a specific statement in mind or something more general?

sjmf commented 1 year ago

@gernathlub Not a worry at all. I also plan to look into metrics of code-coverage for testing as I don't think we currently have a good idea of what our coverage is.

In particular, the predict_depths function performs a database lookup on (line 120 of flood_risk.py). This celery task is particularly heavy as predict_depths runs for each member of a large matrix of geolocated 1x1meter cells (or 'pixels') within the configured area. It is not uncommon to see 300k calls to this function while running the flood model.

The SELECT operation here is a filter on the date and parameters_id fields to check for a duplicate in the database, so that we can choose whether to run an UPDATE or an INSERT in the statement below.

With the UPDATE/INSERT optimisation we have done so far, the database load has been halved, meaning we can now run in 30 minutes a task that would have taken an hour.

In terms of approach, I am wondering if there are two approaches:

Could we do away with the SELECT altogether? It only controls the flow of logic for whether an INSERT or an UPDATE is applied. In raw PostgreSQL, there is INSERT ... ON CONFLICT UPDATE but I don't know if this is supported by Django.
Or, if SELECT is required, perform it outside of predict_depths for a date range, and then use this to launch batches of this task with a parameter for whether to INSERT or UPDATE.

sjmf commented 1 year ago

Test fixtures for models added in commit 4b960b0.

gernathlub commented 1 year ago

@sjmf There are very usefull methods _update_orcreate() and _get_orcreate() but they cannot be performed in a bulk. For possibility of doing one select outside of _predictdepths we could select all predictions by date but as django queries are lazy, it would be neccessary to load them inlo list or iterate the queryset directly. This would cause loading a lot of unneccessary data or end up in doing single SELECTs. Another way is to load only predictions you need by filtering both parameters (date and param_id), but it would require constant searching of a variable in a huge list of the param (using _in ). IDs.

So propably the best way is to use _update_orcreate.

gernathlub commented 1 year ago

@sjmf I made some final modifications to PR #94 . Loading FloodModelParameters data in small batches performs reasonably well (it may be worth considering loading DepthPrediction in batches as well). The rest of the changes don't dramatically reduce the number of queries but the amount of transmitted data. I also did some tests focused on performance and database changes. Could you please check the pull request and let me know if you would like me to change something?

DurhamARC / ManyFEWS

Batch database queries using bulk_create_manager.py #92