centreformicrosimulation / SimPaths

SimPaths is an open-source microsimulation framework for life course analysis, developed and maintained by CeMPA at the University of Essex
6 stars 16 forks source link

Parallel multiruns still locking each other out of input database #24

Closed andrewbaxter439 closed 10 months ago

andrewbaxter439 commented 11 months ago

Hi @pbronka @justin-ven. I've been working on this new repository to see if I can get it doing parallel multiruns again, but have re-introduced the old errors I thought I had overcome (my mistake!). I have two suggested (not mutually incompatible) solutions that I will put in bugfix/ pull requests, but suggest a deeper review/discussion over which would work best as a solution. I've created this issue to be able to reference it with an issue number.

Description of issue

When running a multirun, input folders are copied to the yyyymmddhhmmss_seed\input folder and read from there:

Country: UK. Running simulation from: 2017 to 2020
Reading from database at ./output\20231110101250_101\input\input... Success!
Creating population structures
Will expand the initial population to 20000 individuals, each of whom has an equal weight.

On subsequent runs, in order to not re-copy folder, JAS-mine-core sets copyInputFolderStructure to false and these lines then set the input folder to the root folder. So:

Country: UK. Running simulation from: 2017 to 2020
Reading from database at ./input\input... Success!
Creating population structures
Will expand the initial population to 20000 individuals, each of whom has an equal weight.

However, this interferes with other runs which may attempt to access the database simultaneously, giving:

Random seed 301
Loading model parameters
Country: UK. Running simulation from: 2017 to 2018
Reading from database at ./input/input... failed! Retrying in 2s
Creating population structures
Run Run 0 failed

Likely problem

We had previously aimed to fix this with "AUTO_SERVER=TRUE" here: https://github.com/centreformicrosimulation/SimPaths/blob/2909099624042b85b1723d044ada3328bc6cc00b/src/main/java/simpaths/model/SimPathsModel.java#L2246

This allows two or more runs to send database requests at the same time to 'SimPaths/input/input.mv.db'. BUT as part of the process of re-accessing the database, each run is trying to drop/recreate tables: https://github.com/centreformicrosimulation/SimPaths/blob/2909099624042b85b1723d044ada3328bc6cc00b/src/main/java/simpaths/model/SimPathsModel.java#L2268-L2275

Hibernate on run 1 creates extra columns in these tables:

Hibernate: alter table if exists Person add column covidModuleGrossLabourIncomeBaseline_Xt5 smallint
Hibernate: alter table if exists Person add column flagAlignEntry boolean
Hibernate: alter table if exists Person add column flagAlignExit boolean

But the database re-setting drops these columns and doesn't re-add the new ones. The other runs then cannot find them.

Some potential solutions:

Quick fix: change to "AUTO_SERVER=FALSE"

as this will block other runs from interfering with what the first run is doing until it has finished. Although this means run 1 changes 'SimPaths/input/input.mv.db' to suit its second run then doesn't interact with it again, whilst runs 2+ then step in and repeat the same process as soon as it's left it? It also means that the 'output/yyyymmddhhmmss_seed/input/input.mv.db' database which Hibernate is using remains unchanged from run to run.

Previous solution: change the JAS-mine-core lines to not reset database to 'SimPaths/input/input.mv.db' and change SimPathsModel to not drop tables

I got this working before, by adding a condition to lines 2268-2313 to only do the DROP/CREATE/ALTER TABLES commands on the first run. Subsequent runs seem to be able to read the same database file with no problems. But does the dropping and re-creating need to happen on every run to 'reset' the population?

Solution in works: change JAS-mine-core to not reset and change entityManagerFactory back to null somehow?

These lines in JAS-mine-core re-create the Hibernate database entry, but are only run on the first run: https://github.com/jasmineRepo/JAS-mine-core/blob/f0a411798bdbd66adf2627e44ecd12df13d47481/microsim-core/src/main/java/microsim/data/db/DatabaseUtils.java#L246-L252

Changing the population tables as above seems to drop expected tables from the Hibernate database connection, which aren't recreated (as the lines don't run). Is it possible to force these lines to re-run every time?

Will propose changes, though would appreciate oversight as to what these are exactly accomplishing!

pbronka commented 11 months ago

Thanks @andrewbaxter439 for a very clear exposition of the problem.

Two initial comments:

  1. Is this only about the three columns covidModuleGrossLabourIncomeBaseline_Xt5, flagAlignEntry, flagAlignExit missing on subsequent runs? Perhaps we don't persist the database after they have been created, but we could?
  2. On subsequent runs, in order to not re-copy folder : 2.1. what are the subsequent runs in this context exactly? Is a subsequent run a new instance of a multirun that runs multiple simulations, or does this refer to runs of a single instance of the multirun? 2.2. Could (should?) these subsequent runs access the data from e.g. ./output\20231110101250_101\input\input... instead of ./input\input...?
andrewbaxter439 commented 11 months ago

Thanks for this @pbronka. To clarify:

  1. There's quite a few columns as I recall that Hibernate creates in the Person table on run 1 of each multirun. The database normally seems to persist I think, and when populateTaxdbReferences() doesn't do the DROP/CREATE TABLE Person cycle on ./output\20231110101250_101\input\input it has no trouble finding it again. What puzzles me is whether it is correct to drop/recreate the Person table in one database on run 2 whilst persisting covidModuleGrossLabourIncomeBaseline_Xt5 etc. in another database between runs of each multirun? Or best to re-create all columns every time?
  2. subsequent runs... 2.1 sorry for not being clear. I mean within for example multirun=100, run 2 of this single instance does not re-copy folder (which should be correct? 2.2 I instinctively think that should be the case. Changing JAS-mine-core to a) let each multirun read from its copied database each time and b) refresh the tables at the start of run=2 within the mutirun could be a solution. I'm trialling a branch at https://github.com/jasmineRepo/JAS-mine-core/compare/master...andrewbaxter439:JAS-mine-core:no_db_reset
andrewbaxter439 commented 11 months ago

Update: the JAS-mine-core edit seems to do the trick. Database input urls stay consistent across all runs of the multirun and the database is reset every time. i.e., PERSON table is recreated from PERSON_UK_2017 and Hibernate re-adds all the columns it's expecting:

Hibernate: alter table if exists Person add column parent_socare_hrs float(53)
Hibernate: alter table if exists Person add column careProvidedTo_lag1 varchar(255)
Hibernate: alter table if exists Person add column covidModuleBaselinePayXt5 smallint
Hibernate: alter table if exists Person add column flag_align_entry boolean
Hibernate: alter table if exists Person add column flag_align_exit boolean
Hibernate: alter table if exists Person add column flag_dies boolean
Hibernate: alter table if exists Person add column flag_emigrate boolean
Hibernate: alter table if exists Person add column flag_immigrate boolean
Hibernate: alter table if exists Person add column household_status varchar(255)
Hibernate: alter table if exists Person add column original_id_person bigint
Hibernate: alter table if exists Person add column labour_supply_weekly smallint
Hibernate: alter table if exists Person add column les_c7_covid varchar(255)
Hibernate: alter table if exists Person add column s_index float(53)
Hibernate: alter table if exists Person add column s_index_normalised float(53)
Hibernate: alter table if exists Person add column equivalised_consumption_yearly float(53)

I think we had previously discussed whether all these were still used/needed and whether they should be created in the first run of the multirun anyway? This solution of consistently applying the same setup routine to the second run onwards of each multirun might be the most foolproof way forward if it seems to be doing what's expected?

andrewbaxter439 commented 10 months ago

The PRs #41 and #27 seem to have fixed all these problems and now running smoothly!