Out-of-memory jobs killed by Azure

marghe-molaro commented 1 year ago

The following job was killed by Azure due to RAM issue despite a relatively small population size (150k) long_run_all_diseases-2023-10-09T081418Z

Many runs failed for this job in mode 2, pop size 100k effect_of_policy-2023-10-03T165843Z

marghe-molaro commented 11 months ago

Hi @tamuri, just checking whether we have any new insights as to what may have been causing the excess in requested RAM for these jobs? If not I think I will try to resubmit switching off the conctraception logger (which is not required) and the tlo.methods.healthsystem.summary (which as suggested in PR 1157 may not be behaving as expected) to see if that's enough to take these jobs past the finish line.

tamuri commented 11 months ago

I haven't investigated but I don't think it's the logs because it runs out of memory during the simulation run, not the post-processing of the log file (where a large log file has caused problems in the past). I'm guessing it's the population size. I'll take a look tomorrow.

tamuri commented 11 months ago

Quick test. This is little under 9m of simulation time. There are events causing big spikes in memory usage The Azure VMs have 7.5GB memory, so this would have already been killed.

Rplot

Need to do some more digging, but top offenders are:

   total_gb increase_gb                                                                   event
1  4.784164   4.0286446  <tlo.methods.healthsystem.HealthSystemScheduler object at 0x133317ad0>
2  5.274666   3.7423286  <tlo.methods.healthsystem.HealthSystemScheduler object at 0x133317ad0>
3  4.246059   3.7283936     <tlo.methods.contraception.ContraceptionPoll object at 0x13354ae10>
4  4.491314   3.5014229           <tlo.methods.contraception.DirectBirth object at 0x137beeb10>
5  3.891785   2.8981476  <tlo.methods.healthsystem.HealthSystemScheduler object at 0x133317ad0>
6  3.465004   2.6420860               <tlo.methods.alri.AlriPollingEvent object at 0x132d35b10>
7  2.942112   2.2692070               <tlo.methods.alri.AlriIncidentCase object at 0x1354afd90>
8  2.839962   1.9623909           <tlo.methods.contraception.DirectBirth object at 0x137bd9fd0>
9  2.612389   1.8963928  <tlo.methods.healthsystem.HealthSystemScheduler object at 0x133317ad0>
10 2.350819   1.6135025  <tlo.methods.healthsystem.HealthSystemScheduler object at 0x133317ad0>
11 2.113945   1.5335045  <tlo.methods.healthsystem.HealthSystemScheduler object at 0x133317ad0>
12 1.469296   0.8763275  <tlo.methods.healthsystem.HealthSystemScheduler object at 0x133317ad0>
13 1.947548   0.7416763           <tlo.methods.contraception.DirectBirth object at 0x137870250>
14 1.451740   0.6443176       <tlo.methods.alri.AlriNaturalRecoveryEvent object at 0x18586ed50>
15 1.247070   0.6224289  <tlo.methods.healthsystem.HealthSystemScheduler object at 0x133317ad0>

i.e. in a single execution of health system scheduler, memory increased by 4.03GB.

Population dataframe is steady at 300MB (at least during this short run)

tamuri commented 11 months ago

Left it running overnight (~30 months). Looks like something being collected and then purged (every 6 months??)

Rplot01

tbhallett commented 11 months ago

Just discovering that my runs on extra_big_run_all_diseases-2023-11-03T121619Z and healthsystem_under_different_assumptions-2023-11-03T122326Z (under draw 4) were also killed, probably by the same thing.

matt-graham commented 11 months ago

I've had a go at using memray to do some memory profiling of the src/scripts/profiling/scale_run.py script. This is defaulting to running 50k population and with mode 2. Running a simulation of 2 years gives the following overall memory usage pattern

The sharp peaks seem to correspond to Malaria.malaria_poll2 method called at

https://github.com/UCL/TLOmodel/blob/4bffc1a3c8d5ae7336ea4534486830255a0173e1/src/tlo/methods/malaria.py#L652

which is in a monthly scheduled event MalariaPollingEventDistrict, with around ~300MB increase in memory usage during these peaks matching well with the ~300MB allocations recorded for the malaria_poll2 method in tight windows around the peaks (and not shown in windows outside the peaks).

There is also an underlying sawtooth like trend which shows a slow rise then sharp drop at around 6 month period, though not exactly, as assuming the spikes do correspond to the monthly malaria poll event, then each sawtooth period has a variable number of spikes in it, suggesting this is not something which is at an consistent interval. This is similar to what @tamuri observed, but might be less pronounced due to smaller population size. It looks like this may be related to the Simulation.do_birth method as there is a drop of about 100MB in the recorded allocations for this function across each falling edge of the sawtooth - potentially could be connected to triggering an increase in the population dataframe size and maybe this then triggering some sort of consolidation of memory (very much speculating here!).

I'll try run with a large 150k population size overnight and see what we get.

tbhallett commented 11 months ago

Thanks very much @matt-graham

I think Asif's original profiling might have been in mode 1, and mode 1 is where I hit problems.

If possible, could you run on Mode 1 (as well as Mode 2), as you say?

tbhallett commented 11 months ago

The growing of the data frame does seem like it could be linked to those six monthly-ish changes.

matt-graham commented 11 months ago

Memory usage for a 2 year run of src/scripts/profiling/scale_run.py with 150k initial population and mode 2 (mode 1 run is still on-going, 20 hours in :grimacing:):

This was run from master before #1199 was merged so still showing the sharp spikes that PR should have fixed. Overall pattern is very similar to run with 50k population, other than the value the memory usage stabilizes around after the initial linear transient being about double previously.

Flamegraph for time point corresponding to peak memory usage (only considering functions called from top level Simulation.simulate call) is as follows

Top-level function calls under Event.apply with 5 largest memory allocations are

HealthSystemScheduler.process_events_mode_2 called from HealthSystemScheduler.apply (537.3MiB) Simulation.do_birth called from DirectBirth.apply (479.0MiB) Simulation.do_birth called from BirthAndPostNatalOutcomesEvent.apply (310.7MiB) Diarrhoea.end_episode called from DiarrhoeaNaturalRecoveryEvent.apply (163.0MiB) AlriIncidentCase.apply_characteristics_of_the_case called from AlriIncidentCase.apply (160.3MiB)

While the overall memory usage is high, it still seems to be a bit lower than shown in @tamuri's plots and without the same amplitude in the variation.

Unfortunately the full memray output is very large (~8GiB) and rendering a full 'temporal' flamegraph which allows selecting the window over which to view the flamegraph seems to be beyond my laptop.

matt-graham commented 11 months ago

Memory usage for a 2 year run of src/scripts/profiling/scale_run.py with 150k initial population and mode 1

Again from master before #1199 was merged hence sharp spikes connected to malaria poll visible here.

Broadly similar pattern to type 2 plot in above comment, other than a more significant rising trend and then drop between 12:00 and 15:00 on time axis, and larger overall memory use.

Flamegraph for time point corresponding to peak memory usage (only considering functions called from top level Simulation.simulate call) is as follows

Top-level function calls under Event.apply with 5 largest memory allocations are

HealthSystemScheduler.process_events_mode_0_and_1 called from HealthSystemScheduler.apply (1.3GiB) Simulation.do_birth called from DirectBirth.apply (371.0MiB) Simulation.do_birth called from BirthAndPostNatalOutcomesEvent.apply (350.6MiB) AlriIncidentCase.apply_characteristics_of_the_case called from AlriIncidentCase.apply (180.4MiB) SchistoSpecies.update_infectious_status_and_symptoms from SchistoMatureWorms.apply (174.2MiB)

This overlaps a lot with list from mode 2 run, other than HealthSystemScheduler.process_events_mode_2 being swapped for HealthSystemScheduler.process_events_mode_0_and_1, and this function using a lot more memory overall (presumably because of longer HSI event queues?).

tamuri commented 11 months ago

While the overall memory usage is high, it still seems to be a bit lower than shown in tamuri's plots and without the same amplitude in the variation.

Mine was monitoring the process RSS, which might account for these differences.

The ~400MB memory usage in do_birth sounds right for the size of population dataframe (i.e. if rows need to be added, which is not on every birth, the entire dataframe will be copied).

marghe-molaro commented 11 months ago

Many thanks for these @matt-graham.

@tbhallett just sharing some thoughts on what might be going on in mode 1 vs mode 2:

About larger memory requirements for mode 1 HealthSystemScheduler: in mode 1 vast majority of hsis are delivered when requested so overall will have more successful referrals, and hence higher volume of appts scheduled /queued. This might explain higher HealthSystemScheduler size in mode 1 compared to mode 2, despite appts spending less time in the queue and counter to what we expected.
About steady increase in mode 1 vs more stable behaviour (aside from fluctuations) in mode 2 memory: mode 1 schedules and delivers (and hence leads to referral cascades) for as many appts as required by pop, so these will increase as pop grows. Mode 2 instead will always deliver a fixed number of appts and lead to a fixed number of referrals, so even if demand for initial appts increases with pop size the overall number of appts scheduled should be increasing much more slowly than in mode 1. Also note that mortality will be mich higher in mode 2 because not all treatment is delivered, so pop size growth will be slower than in mode 1 (may even potentially decrease). Mode 2 will deliver fewer contraception appts but in a 2 year period I think excess deaths will dominate pop size dynamics.

marghe-molaro commented 6 months ago

Hi @tamuri & @matt-graham ,

I'm about to submit some large jobs in mode 2 on Azure again - just wondering whether it would be best to go for the --more-memory option again, or whether we think there's been significant enough process on this not to warrant the more expensive larger RAM option. Many thanks in advance!

tamuri commented 6 months ago

No, we haven't tackled memory usage, so you'll want to use the larger RAM option.

UCL / TLOmodel

Out-of-memory jobs killed by Azure #1163