Closed bnlawrence closed 3 years ago
Hi Bryan,
let me give you a detailed explanation on how the model works.
In the world creation stage the following things happen:
residence
, primary_activity
, medical_facility
, commute
, rail_travel
(not currently used), leisure
, and box
(ignore as well). This IntEnum is stored in person.subgroups
. A series of distributors assign people to the different groups and subgroups based on different criteria like workflow data or geographical proximity. We always use the location of someone's household area as their geographical location. For instance, a teacher, John, living in Durham and commuting to Newcastle would have:person.residence
pointing to the subgroup adults of his household.person.primary_activity
pointing to the subgroup teachers of a school in Newcastle.person.commute
pointing to a commutehub in the Newcastle city.Residence and primary_activity are assigned at the world creation stage and never change. On the other hand, the activity person.leisure
is dynamical, and at each timestep involving leisure, it points to a different place (pub, grocery store, etc.)
During the simulation, this is how people are moved:
person.leisure
is set to None
), but the subgroups themselves are cleared from all people information belonging to them..people
list in the relevant subgroup.If a person needs to be hospitalised, then the Hospitalisation
policy assigns the patients
hospital subgroup to person.medical_facility
. When we distribute people to the active subgroups, medical facilities always take preference over any other activity so they get sent to hospital regardless of all the other activities.
At the interaction stage, for each group, each of the group's subgroups interacts with themselves and other subgroups according to a contact matrix. The infection probabiltiy takes into account how many contacts someone has in a specific place. For instance, in the case of a company, if every day someone makes 5 contacts in their company, then in the infection probability we have a factor 5 / size(subgroup = company workers)
to reflect how contacts are distributed.
Hope this is enough to have a first picture, let me know if you need anything more.
Thanks Arnau, that's really helpful. That looks really parallelisable insofar as we just need to make sure that we move people between regions of their work is in a different region (and we think carefully about how we handle the commute). My suspicion is that will be easiest to duplicate a bit of work across region boundaries and make sure we duplicate when reporting results. But we can look at that. (I'll go quiet for a bit, I'm on leave from lunchtime, but I'll be coming back to this.)
Hi @arnauqb, just to jump in here as Bryan is now on leave and we want to make as much progress as we can to report back to him with next week :slightly_smiling_face: Thanks for your summary above, it is really useful to myself also.
Regarding geo-locality and otherwise, can I ask a few follow-up questions to help our understanding which I cannot be sure of the answers to given your summary and (without digging deep into) the codebase:
Are commutes the only cases where people can be moved to a different NHS region (temporarily or otherwise)? For example, you say:
We always use the location of someone's household area as their geographical location.
but thinking in terms of the reality, I think it would not be uncommon in real life for people to change NHS region if they are being hospitalised? Especially if the local hospitals were reaching capacity? Does the Hospitalisation
policy currently include any logic similar to this that may move people across NHS regions?
You have stated things like:
For a leisure time-step
Similarly, for a commute time step
So, am I correct in deducing from this that each timestep can be categorised according to one of the activities (based on the datetime e.g. naturally overnight most people will be in their residence, etc.)?
You mention some checks, i.e:
For instance, a worker first checks if he needs to go to work, otherwise he stays home. A retired person first checks leisure, then stays home, etc.
From the way you have worded this, it sounds like these checks are conducted on a case-by-case i.e. person-by-person basis, rather than managing this on a group level e.g. we take X % of workers and send them to work in a given scenario. Is that correct? And if so, is there a reason it is done that way? I may be missing something important in my understanding here.
Finally, an off-topic (at least, non-optimisation) question that has came to me:
Hi @arnauqb, just to jump in here as Bryan is now on leave and we want to make as much progress as we can to report back to him with next week Thanks for your summary above, it is really useful to myself also.
Regarding geo-locality and otherwise, can I ask a few follow-up questions to help our understanding which I cannot be sure of the answers to given your summary and (without digging deep into) the codebase:
- Are commutes the only cases where people can be moved to a different NHS region (temporarily or otherwise)? For example, you say:
We always use the location of someone's household area as their geographical location.
No, there are more cases. Don't think as commute as something special, commute is just another group one can go to (a train carriage), so there is nothing intrinsicaly different than the other subgroups. If you have a look at configs/config_example.yaml
you'll see that there are time steps where there is commute. For each day time, there is a list of activities that need to happen, and we apply some hierarchy to them for each person. For isntance, if someone has a public person.mode_of_transport
, and their household is in an area that has commute (for instance London), then they will be assigned to a train carriage in that timestep while other people will remain home for the duration of that timestep.
When we assign a working place to someone, we look at the flow data from the census, that means we have quite a lot of people moving a lot (20k people live/work in London/North East and work/live in London/North East, for example). And these will be teleported from their home to their workplace everyday. Similarly, kids go to one of their nearest schools, but if they are in a region border then it's likely they go to schools in different NHS regions than their household.
but thinking in terms of the reality, I think it would not be uncommon in real life for people to change NHS region if they are being hospitalised? Especially if the local hospitals were reaching capacity? Does the
Hospitalisation
policy currently include any logic similar to this that may move people across NHS regions?
We do keep track of the hospital icu/bed capacity, but we do not use it to make decisions. So each person is sent to the closest hospital (we model NHS trusts rather than individual hospitals) to their household. We'll probably improve this in the future so we would have to keep it general.
You have stated things like:
For a leisure time-step
Similarly, for a commute time step
So, am I correct in deducing from this that each timestep can be categorised according to one of the activities (based on the datetime e.g. naturally overnight most people will be in their residence, etc.)?
Refering to the previous config file I mentioned, I call a leisure time-step a a timestep that contains leisure as one of their activities. And you can also see the rest of the schedule (and durations) there. That config file can be changed at the will of the user in a very flexible way. The activity_hierarchy
can be found at the top of june/activity/activity_manager.py
.
You mention some checks, i.e:
For instance, a worker first checks if he needs to go to work, otherwise he stays home. A retired person first checks leisure, then stays home, etc.
From the way you have worded this, it sounds like these checks are conducted on a case-by-case i.e. person-by-person basis, rather than managing this on a group level e.g. we take X % of workers and send them to work in a given scenario. Is that correct? And if so, is there a reason it is done that way? I may be missing something important in my understanding here.
Yep, that's the case. This is the time step for the activity manager, specifically, the move_people_to_active_subgroups
function in june/activity/activity_manager.py
, which loops over all the people ni the world and assigns them a subgroup, taking into account the active policies. We went for the option of doing it individually to give us the most amount of flexibility, so that we could have really individual policies (based on person, age, sex, work sector, etc.).
Finally, an off-topic (at least, non-optimisation) question that has came to me:
- Does the model account for people shopping (for essentials, mainly)? I can't see any explicit mention across activities or "social" venues, but given that this was all we were allowed to do during the March to ~July lockdown I would have thought it was an important category of activity. So I am just wondering how that is included, if at all, in the model.
Yes, we have the geolocations of every single grocery shop in the UK, and people can go shopping during their leisure time. This is one of the activities that it is not shut down after lockdown. The frequency at which people go shopping is taken from the UK free time survey, as well as the timings for other lesiure activities. We do not model panic buying during the week of 16th of March onwards, but we could certainly do so. We also do not ucrrently model which shops would be closed during lockdown.
I've seen commute mentioned quite often on this thread, but I just wanted to clarify that although individual commute carriages are filled up with random people every time commute is called, we only have commute within major cities. Therefore, this shouldn't be an issue at all if the code is parallelized based on NHS regions. Right now the most common problem will be someone living in a different region to the one they work (these people do not commute, they teleport). There will also be a few kids going to school in a different region, or a few people close to the region's border whose favorite pub/grocery shop/cinema is in a different region
Thank you both, they are really useful answers and comments. We are going to continue to understand the model and have a think about the things you have said and what we have learnt so far, but I am sure we will get back to you soon about this idea, at the latest next week when Bryan is back from leave.
Thanks for all the investigations. Can I just explore what happens now if you just do a region (e.g. London, and whatever regions you are doing for calibration).
- What happens to People who would be "teleported" elsewhere not in the region? (Presumably teleported somewhere in the region?)
- What happens to people who live on the edges? Are their (e.g. leisure) activities constrained to be selected from (e.g. pubs) within the region?
Yes, same as above for schools.
- Presumably you just ignore the missing people who might have been teleported in?
That's correct.
Ok, thanks, to start thinking about mechanisms then:
This list of people needs to be updated by more than one region over time, but only ever one region at any one timestep, apart from the point where they get moved.
All regions would have a halo made up of N sub-halos (which are on the boundary or from where folks teleport) and each of these would have a list of people who could be moving.
Does this sound sensible so far?
(I am assuming the sum of people in the halo would be much smaller than the number of people in the region for any region of any significant size.)
Ok, thanks, to start thinking about mechanisms then:
- we need a place to put people when they shouldn't be "in the region", so the "regional calculation" can ignore them when they're not around.
- we need a place to look to find people who should be in the region for this timestep
- in both cases we would need to know where they should be going-to (in the-other/in-this region).
This list of people needs to be updated by more than one region over time, but only ever one region at any one timestep, apart from the point where they get moved.
Sounds good, we already explored this idea with a special group called "boundary" where we would put the people in the region that work outside the region. However we dropped it for simplicity and just randomly distributed the workers inside the region. But yes, having this ingoing / outgoing hubs sounds the way to go.
All regions would have a halo made up of N sub-halos (which are on the boundary or from where folks teleport) and each of these would have a list of people who could be moving.
- At the point where people are infected and symptomatic, they're at home or hospital, either way, they don't move (across regions)
One thing to keep in mind is that the decision whether they stay home or not when they are symptomatic is based on policies so it changes over time and we might make it region dependent as well. But that should not be an issue.
- Otherwise they might move (they may choose to go to a pub across the boundary or not) or it might or might not be a work day.
Does this sound sensible so far?
Sounds really good to me. Is there anything code related you want us to look at? I'm afraid I have zero experience with process communication in Python, but happy to help in other areas.
PD: as an Astronomer I really like the halo / sub halo naming.
Great. The simplest way to implement this would be to do nothing very intellectual. We could run each region as a standalone executable, and communicate via files. This would probably be the easiest and fastest way to test the idea. If each timestep is slow enough (and it looks like it probably is), then loading and unloading a list of tuples which correspond to (for each person actually moving)
(person_id, group_to_join, health status updates)
might be efficient enough. Each executable would have to wait on a file with the right timestep name at the end of each timestep. We could hide this all in a method which we could later use for a more sophisticated strategy. What we wouldn't do is pickle/depickle the person, because the person would be in each region, we'd simply have a method to update the status of the person based on what we want to pass between regions, which is a very small subset of person attributes.
I think the next step would be to think about what we need to do to set it up in terms of the world config and the various boundary lists. I have a meeting this afternoon. When that is done, I will return to this :-) - but it might be we should have a quick chat on the zoom tomorrow.
(We'd have all sorts of load balancing problems, but that'd be a nice problem to have, it'd mean we had some real parallelism in play. If we get this right there is no a priori reason why we have to do this on NHS regional boundaries, we could go smaller, it will all depend on the balance of work versus coms costs.)
Sounds great. I think you only need to run the health status updates in the persons' native region. We can decouple the hospitalisation from there so that we can do it locally.
We normally save the world into an hdf5 file, so we could create one big hdf5, and each regional process reads the area of interest. Then you only have to communicate where that person goes, and whether that person got infected or not, I would then allocate an infection always in the process where the persons' household is.
I am free for a Zoom call pretty much anytime this week.
Anyone else tracking this let me know, and I'll email you coordinates for a call at 10.30 this morning.
We have us a first fully working (famous last words :grin: ) prototype of parallel run on two separate regions with influx/outflux of workers during AM times! Have a look here https://github.com/valeriupredoi/JUNE-1 (master branch): I gave @arnauqb and @florpi admin rights, @grenville can write to the repo and the usual suspects @bnlawrence and @sadielbartholomew can snoop around :mag: First off - I ran it on my two-core laptop (with mpirun -v -np 2 -npersocket 2 python run_simulation.py
) and having two processes for two (hardcoded) areas, total population for both areas of 163k people, took a very stable 346s (as compared to the serial time for the same people of about 520-540s). memory stays the same as per the serial process for each of the two parallel processes at about 550M peak (there are a few places in the code where memory can be released, especially in the parallel.py
module where there are people that are not used per sub-area, the binable
people - but for that to happen a lot of refs have to be removed). This is what @bnlawrence started a few days ago and @grenville and meself have made it into a working prototype, there's a lot to be added and generalized but I thought you'd be happy to see&test this before the weekend, to start it off on a good note :grin: PS @grenville is currently testing it on LOTUS/Jasmin
Amazing! Can't wait to have a look
@arnauqb just a heads up I just tested it on JASMIN on sci5 (one of the scientific nodes which is usually pretty clogged up) and it ran on 2 procs at 426s (serial would take about 700s) and on 4 procs (mpirun -v -np 4 -npersocket 2 python run_simulation.py
) at 288s (wowza!) and sorry, I was wrong - the areas are not hardcoded to 2 areas, the domain gets split in as many areas as processes is run with (I tested it so many times on 2 on my 2-core laptop that I hardcoded that info in my brain, @grenville pointed that out :grin: ). OK, fish and chips time now, have a good weekend!
Hi Valeriu,
thanks! It looks very promising, I will write some functions so that the domain decomposition is done while we load the world from the hdf5, otherwise loading n
amount of worlds and then clearing memory is going to kill us.
PD: you really like fish and chips
oh absolutely (on both accounts :grin: ) Yeah man, there's a few pointers in the comments where I said this is defo not nice and optimal but I didn't know how else to do it. Bryan mentioned the people binning too. But I think it's in a good enough shape to form a prototype that you guys can make it a production Ferrari :car: I am happy there is conservation of number of people and infections across processes for now :grin:
You may have realised there has been quite a few changes in the infection
modules recently. I have basically cleaned lots of attributes that were not used, and removed HealthInformation
as it was quite redundant with infection. I have merged your master's branch with June's master branch and solved all the conflicts here:
https://github.com/IDAS-Durham/JUNE/tree/parallelisation/mpi
@arnauqb man cheers muchly! I grabbed that branch and all runs well :beer: Moreover, we can run with slightly more initial infections compared to my fork, we have just noticed that upping the number of initial infections makes the thing grind to a halt (actually not even start the main loop): my fork belyups at > 60 init infections whereas parallelisation/mpi
can run with 125-130 init infections; @grenville had a look into it already, I'll do some investigations too after lunch :+1:
When the model gets stuck each cpu is running at 100% so I'm guessing the slow down is pickling related. I've not been patient enough for it to get past the hold up -- it is a bit odd that the model simply hangs up and doesn't just run slower. If instead of moving infections around, transmissions are moved, the model runs OK (I've not tried with more than 500 initial infections) - I guess we need to rethink how to pass around the relevant information (this has already been raised as a performance issue). @arnauqb - I'm not sure exactly what needs to be moved - is the list of attributes easily compiled.
@grenville @arnauqb I found the bottleneck: it's the send/receive comms: specifically for me the first batch of send is not happening comm.send(tell_them, dest=other_rank, tag=100)
from the PM/wknd bit in parallel.py
- if tell_them
is small (order 5 dict items) it sends and receives it no problemo, if it gets bigger (order 20 items) it refuses to send and just hangs around
Hi sorry, just catching up on this since I've been on leave. This is all looking great! Just a note on commute - with PR #313 you can now call person.commute.group.super_area
if person.commute is not None
. Hope this is helpful.
Something to note here is that the super_area assigned to the commute group is that where the central station of that city resides.
@grenville @arnauqb I found the bottleneck: it's the send/receive comms: specifically for me the first batch of send is not happening
comm.send(tell_them, dest=other_rank, tag=100)
from the PM/wknd bit inparallel.py
- iftell_them
is small (order 5 dict items) it sends and receives it no problemo, if it gets bigger (order 20 items) it refuses to send and just hangs around
OK and those are Infection
objects - hefty hefty - so it's pickling those until they pickle :pick: So if we can use transmission as @grenville points out, maybe Transmission
object is less hefty. Can we not just pass around the PID (Person ID) and convert it to infected
via person.infected = True
when they cross the border? I mean, do they have to store all the information in the infection
attribute?
assembling the data to be sent via comms into a dictionary gives us an idea of how much is too much for the pickler:
for person in outside_domain:
person.busy = True
if person.infected: # it happened at work!
tell_them[person.id] = person
persdict = {}
for pid, obj in tell_them.items():
persdict[pid] = {}
persdict[pid]["infection"] = {}
persdict[pid]["transmission"] = obj.infection.transmission
persdict[pid]["infection"]["tag"] = obj.infection.symptoms.tag
persdict[pid]["infection"]["time_exposed"] = obj.infection.symptoms.time_exposed
persdict[pid]["infection"]["time_of_symptoms_onset"] = obj.infection.symptoms.time_of_symptoms_onset
persdict[pid]["infection"]["trajectory"] = obj.infection.symptoms.trajectory
comm.send(persdict, dest=other_rank, tag=100)
send with 100 initial infections is OK, more than that a no-no. Commenting out persdict[pid]["infection"]["trajectory"] = ...
will allow to send OK for about 250 initial infections, more than that nope. So reducing size of sent objects is paramount to be able to MPI proper :beer:
We only really need to send two numbers: person.susceptibility
and person.infection.transmission.probability
(if that person is infected). When the person "comes back" we only need to know if we need to infect them or not.
We only really need to send two numbers:
person.susceptibility
andperson.infection.transmission.probability
(if that person is infected). When the person "comes back" we only need to know if we need to infect them or not.
I think we only need a simple method on person to generate and receive these numbers, but as I said in the call, for now , I think we should keep that method sitting inside the parallel code. I'm sorry I failed to make that clear before I went on leave, it seems like it's held us up by a day or two ...
We only really need to send two numbers:
person.susceptibility
andperson.infection.transmission.probability
(if that person is infected). When the person "comes back" we only need to know if we need to infect them or not.
@arnauqb cheers for the clarification man! I reckon in that case it should be easy-peasy!
@bnlawrence no worries, I was 100% in serial land last week before I started looking at your implementation, so am sure you said it but I didn't register it :+1:
@arnauqb Do we only need to update those if someone is actually infected while away? Otherwise there is no new information right?
Yes, exactly. Susceptibility only changes when you are infected (it is set to 0), and the transmission probability is updated at every time step but only for the infected people.
Current status overnight is that the parallelisation/mpi
branch in my fork nearly works, but has a bug associated with the infector_id. I think that might well be an actual bug (insofar as it happens before anyone is exchanged), but it's a fortunate bug in that it shows a major problem with the current parallelisation strategy. The problem occurs here (in simulator.py
):
for infector_id in chain.from_iterable(int_group.infector_ids):
infector = self.world.local_people[infector_id - first_person_id] # why this? # V: good question!
assert infector.id == infector_id
infector.infection.number_of_infected += (
n_infected
* infector.infection.transmission.probability
/ tprob_norm
Recall that the strategy is that we only have a local world of people who live in a domain, or who live in a domain, but work elsewhere. Everyone else needs to be removed (eventually) so the memory per domain is a fraction of the entire world domain. People can work in other domains.
Unfortunately that means someone can be infected in "the other domain" by someone who does not live in their domain, then come back to their domain, and infect someone. This will break the way this is done insofar as we would need access in our domain to the infection status of a person who does not live (or work) here.
There are a number of strategies we could use to fix this, but we'd need to decide them with the core team, as they all involve somewhat more intrusive code changes than those so far.
(Actually, I don't think this is a bug of mine, it is a direct result of the idea that the person.id
can be used to index into the population since the population is an ordered list. That's clearly going to break when we have local populations which are a subset of the world population.)
Relying on world.people
being an ordered list always worried me.
We generate the ids for everything automatically, that means that if, in the same Python session, you create multiple worlds, the ids of the people in the worlds after the first one won't start with 0, that's why I correct with a shift so that I start at the first person of the population for every particular world.
We could make world.people
be a dictionary and the problem would be solved right? As for the particular issue, the infector.infection.number_of_infected
is only used to calculate R0 and nothing else, so I would deactivate that now and maybe we can figure something out like writing the infector's id to the logger file directly, and add it up there.
That's really helpful. Indeed using a dictionary sorts out a number of problems, and we can do that in the new DomainPopulation
class which (at some point) ought to be a subclass of Population
- at some future time we can bring those together. It will come with some memory overhead though ...
I have been thinking about the infected problem, and indeed, it can be solved by doing the infection stats post-fact via the logger. There are other methods too, but for now, we'll hack it.
(At the moment we have a rather nasty hack called local_people
which replaces people
, but as I say, that can be more elegantly handled once it all works as expected.)
I wanted to share this with you to show to you how important and timely is the work you are doing on parallelization. So far, we have been running individual NHS regions to calibrate the model. However, we now know that to get a consistent fit across regions we need inter-regional mixing and therefore we need to run the whole of England, since the corrections are not small. For whole England runs the current code is however too slow. Here are some results after two days of running England (for one example run),
So we definitely need to speed this up! Thanks for the work you are doing
Thanks for the info. Sorry it's taking so long.
It's not taking long at all! I just wanted to share it so that you see the end use of your work :)
Morning. Current status is that we have things happily running for 30 odd timesteps, but breaking due to someone being busy (we think) at the beginning of an activity_manager.move_people_to_active_subgroups
. It's possible that's our fault, but is there any situation that would mean someone could be busy after a clear_world
?
we've had problems with zombies in the past, can you please check whether the person that is busy is also dead? It'd be good to look at their symptoms and activities too, you can do person.infection.symptoms.tag
and [activity.group.spec for activity in person.activities if activity is not None]
Hi - we're trying to figure out the absolute minimal amount of information to pass around for people living in one domain and working in another. If we can figure out this, we'll be one step ahead - the most difficult case (as I see it) is: if a person is first infected in the work [home] domain - that person in the home [work] domain halo does not yet have an infection attribute - passing the entire infection class is too expensive in comms. Can we simply pass the person infection status and the infection start_time for example and allow the healt_status_update to fill in the remaining fields (transmission, symptoms ...)
Hi Grenville,
Let us call the domain where the person has their household the home domain, and the work domain where the person works. I think home domain should handle all the infection business, so if the person gets infected at work, the work domain tells the home domain to generate an infection for that person.
Once a person is infected, if that person is sent away to another domain (where that person will not have an infection), the home domain sends person.susceptibility
and person.infection.infection_probability
. The Interaction
module only handles people ids, susceptibilities, and transmission probabilities, so we do not need to have an infection created in the person copy of the work domain. When we do interactive_group = InteractiveGroup(group)
in simulator.py
, the relevant infection information is extracted from the people in the group, so I think that would be the place to read the susceptibiltiy / transmission for the infected person that is coming from another domain.
quick update on the parallel run - it runs to final, with what we believe it is the correct setup and inter-domain comms; quick question for the code gurus: the reason why we had an issue with a spurious couple of persons that were extra to the domain population is that in june/groups/group/subgroup.py
when calling append
a few non-active persons were appended; I solved that temporarily by adding a conditional in append:
def append(self, person: Person):
"""
Add a person to this group
"""
if person.active:
self.people.append(person)
person.busy = True
this is obviously not the way for stable code since active
is set inly if parallel_setup
is done, but we need to locate where the non-active person is coming from. I did a bit of black-box testing and I located the problem in leisure.py
where mates are being assigned:
if random() < probability:
for mate in person.residence.group.residents:
if mate != person:
if mate.busy:
if (
mate.leisure is not None
): # this perosn has already been assigned somewhere
mate.leisure.remove(mate)
mate.subgroups.leisure = subgroup
subgroup.append(mate)
else:
mate.subgroups.leisure = (
subgroup # person will be added later in the simulator.
)
indeed when mate is not busy, that mate is not active. But I completely lost track of it outside leisure - is it added in simulator
as the comment says? And if so where? Cheers guys! :beer:
Hi Grenville,
Let us call the domain where the person has their household the home domain, and the work domain where the person works. I think home domain should handle all the infection business, so if the person gets infected at work, the work domain tells the home domain to generate an infection for that person.
Once a person is infected, if that person is sent away to another domain (where that person will not have an infection), the home domain sends
person.susceptibility
andperson.infection.infection_probability
. TheInteraction
module only handles people ids, susceptibilities, and transmission probabilities, so we do not need to have an infection created in the person copy of the work domain. When we dointeractive_group = InteractiveGroup(group)
insimulator.py
, the relevant infection information is extracted from the people in the group, so I think that would be the place to read the susceptibiltiy / transmission for the infected person that is coming from another domain.
awesome, cheers @arnauqb :beer: Could you spare a coule seconds talk about that mate too pls :beers:
Hi Valeriu,
yes, sure, do you want to have a zoom call? I'm available for the rest of today.
@arnauqb man, am done for today and am taking tomorrow off (JUNE-off that is, gonna have to do some other work stuffs), let's have us a meeting on Tuesday when @bnlawrence is back too :beer:
quick update on the parallel run - it runs to final, with what we believe it is the correct setup and inter-domain comms; quick question for the code gurus: the reason why we had an issue with a spurious couple of persons that were extra to the domain population is that in
june/groups/group/subgroup.py
when callingappend
a few non-active persons were appended; I solved that temporarily by adding a conditional in append:def append(self, person: Person): """ Add a person to this group """ if person.active: self.people.append(person) person.busy = True
this is obviously not the way for stable code since
active
is set inly ifparallel_setup
is done, but we need to locate where the non-active person is coming from. I did a bit of black-box testing and I located the problem inleisure.py
where mates are being assigned:if random() < probability: for mate in person.residence.group.residents: if mate != person: if mate.busy: if ( mate.leisure is not None ): # this perosn has already been assigned somewhere mate.leisure.remove(mate) mate.subgroups.leisure = subgroup subgroup.append(mate) else: mate.subgroups.leisure = ( subgroup # person will be added later in the simulator. )
indeed when mate is not busy, that mate is not active. But I completely lost track of it outside leisure - is it added in
simulator
as the comment says? And if so where? Cheers guys! 🍺
So I don't understand. If I simply comment out the else part of this loop, I would have thought it would remove the problem. But it does not ...
We have discussed the possibility of using node level parallelisation for regions, and core-level parallelisation for loop.
Probably the easiest regional decomposition to start with would be to look at running each of the NHS regions in parallel and moving people between them as required at the beginning or end of each timestep.
To do that, we need to understand a bit more about the geolocality in the model. Where and how are people moved, and how is their geographical locations used?
With respect to the second question, naively I assume there is some convolution of infection probability and people "met" in the Interaction part of the model. Is that right?
But the first question is the most pressing.