parallelisation by NHS region

bnlawrence commented 4 years ago

We have discussed the possibility of using node level parallelisation for regions, and core-level parallelisation for loop.

Probably the easiest regional decomposition to start with would be to look at running each of the NHS regions in parallel and moving people between them as required at the beginning or end of each timestep.

To do that, we need to understand a bit more about the geolocality in the model. Where and how are people moved, and how is their geographical locations used?

With respect to the second question, naively I assume there is some convolution of infection probability and people "met" in the Interaction part of the model. Is that right?

But the first question is the most pressing.

arnauqb commented 4 years ago

Hi Bryan,

let me give you a detailed explanation on how the model works.

In the world creation stage the following things happen:

Geography is initialised by giving a list of regions, super areas, or areas.
The world is populated following the census data of the specified geography.
"Buildings" like companies, hospitals, schools, pubs, are initialised for the specified geography. We refer as Supergroup to a collection of groups of the same kind (eg Schools is a supergroup). We call a Group a particular building (eg a School). We call a Subgroup to a subdividion of people in a group (eg a school year in a school, residents in care homes, etc.)
Each person carries an Activities IntEnum that has 6 fields: residence, primary_activity, medical_facility, commute, rail_travel (not currently used), leisure, and box (ignore as well). This IntEnum is stored in person.subgroups. A series of distributors assign people to the different groups and subgroups based on different criteria like workflow data or geographical proximity. We always use the location of someone's household area as their geographical location. For instance, a teacher, John, living in Durham and commuting to Newcastle would have:

person.residence pointing to the subgroup adults of his household.
person.primary_activity pointing to the subgroup teachers of a school in Newcastle.
person.commute pointing to a commutehub in the Newcastle city.

Residence and primary_activity are assigned at the world creation stage and never change. On the other hand, the activity person.leisure is dynamical, and at each timestep involving leisure, it points to a different place (pub, grocery store, etc.)

During the simulation, this is how people are moved:

Clear world: All groups are cleared, people keep their subgroup attributes (person.leisure is set to None), but the subgroups themselves are cleared from all people information belonging to them.
Move people to active subgroups: According to the current time step (ie weekday, weekend, 8am, etc.) people are distributed to their subgroups following an activity hierarchy. For instance, a worker first checks if he needs to go to work, otherwise he stays home. A retired person first checks leisure, then stays home, etc. When someone is assigned somewhere, then that person is appended to a .people list in the relevant subgroup.
For a leisure time-step, each person decides if they go to a leisure activity and which one according to some age/sex probabilities. They are then assigned to the relevant social venue subgroup. A dictionary of social venue candidates is stored in the persons' household. For instance, each household has a list of the 5 closest pubs, and if someone in that household decides to go to the pub, they pick randomly from that list. Similarly, for a commute time step, people are assigned to different "train carriages" belonging to the person's commutehub.
Interaction is carried over groups, health status updated,
Advance timer
Repeat

If a person needs to be hospitalised, then the Hospitalisation policy assigns the patients hospital subgroup to person.medical_facility. When we distribute people to the active subgroups, medical facilities always take preference over any other activity so they get sent to hospital regardless of all the other activities.

At the interaction stage, for each group, each of the group's subgroups interacts with themselves and other subgroups according to a contact matrix. The infection probabiltiy takes into account how many contacts someone has in a specific place. For instance, in the case of a company, if every day someone makes 5 contacts in their company, then in the infection probability we have a factor 5 / size(subgroup = company workers) to reflect how contacts are distributed.

Hope this is enough to have a first picture, let me know if you need anything more.

bnlawrence commented 4 years ago

Thanks Arnau, that's really helpful. That looks really parallelisable insofar as we just need to make sure that we move people between regions of their work is in a different region (and we think carefully about how we handle the commute). My suspicion is that will be easiest to duplicate a bit of work across region boundaries and make sure we duplicate when reporting results. But we can look at that. (I'll go quiet for a bit, I'm on leave from lunchtime, but I'll be coming back to this.)

sadielbartholomew commented 4 years ago

Hi @arnauqb, just to jump in here as Bryan is now on leave and we want to make as much progress as we can to report back to him with next week :slightly_smiling_face: Thanks for your summary above, it is really useful to myself also.

Regarding geo-locality and otherwise, can I ask a few follow-up questions to help our understanding which I cannot be sure of the answers to given your summary and (without digging deep into) the codebase:

Are commutes the only cases where people can be moved to a different NHS region (temporarily or otherwise)? For example, you say:

We always use the location of someone's household area as their geographical location.

but thinking in terms of the reality, I think it would not be uncommon in real life for people to change NHS region if they are being hospitalised? Especially if the local hospitals were reaching capacity? Does the Hospitalisation policy currently include any logic similar to this that may move people across NHS regions?
You have stated things like:

For a leisure time-step

Similarly, for a commute time step

So, am I correct in deducing from this that each timestep can be categorised according to one of the activities (based on the datetime e.g. naturally overnight most people will be in their residence, etc.)?
You mention some checks, i.e:

For instance, a worker first checks if he needs to go to work, otherwise he stays home. A retired person first checks leisure, then stays home, etc.

From the way you have worded this, it sounds like these checks are conducted on a case-by-case i.e. person-by-person basis, rather than managing this on a group level e.g. we take X % of workers and send them to work in a given scenario. Is that correct? And if so, is there a reason it is done that way? I may be missing something important in my understanding here.

Finally, an off-topic (at least, non-optimisation) question that has came to me:

Does the model account for people shopping (for essentials, mainly)? I can't see any explicit mention across activities or "social" venues, but given that this was all we were allowed to do during the March to ~July lockdown I would have thought it was an important category of activity. So I am just wondering how that is included, if at all, in the model.

arnauqb commented 4 years ago

Hi @arnauqb, just to jump in here as Bryan is now on leave and we want to make as much progress as we can to report back to him with next week Thanks for your summary above, it is really useful to myself also.

Regarding geo-locality and otherwise, can I ask a few follow-up questions to help our understanding which I cannot be sure of the answers to given your summary and (without digging deep into) the codebase:

Are commutes the only cases where people can be moved to a different NHS region (temporarily or otherwise)? For example, you say:

We always use the location of someone's household area as their geographical location.

No, there are more cases. Don't think as commute as something special, commute is just another group one can go to (a train carriage), so there is nothing intrinsicaly different than the other subgroups. If you have a look at configs/config_example.yaml you'll see that there are time steps where there is commute. For each day time, there is a list of activities that need to happen, and we apply some hierarchy to them for each person. For isntance, if someone has a public person.mode_of_transport, and their household is in an area that has commute (for instance London), then they will be assigned to a train carriage in that timestep while other people will remain home for the duration of that timestep.

When we assign a working place to someone, we look at the flow data from the census, that means we have quite a lot of people moving a lot (20k people live/work in London/North East and work/live in London/North East, for example). And these will be teleported from their home to their workplace everyday. Similarly, kids go to one of their nearest schools, but if they are in a region border then it's likely they go to schools in different NHS regions than their household.

but thinking in terms of the reality, I think it would not be uncommon in real life for people to change NHS region if they are being hospitalised? Especially if the local hospitals were reaching capacity? Does the Hospitalisation policy currently include any logic similar to this that may move people across NHS regions?

We do keep track of the hospital icu/bed capacity, but we do not use it to make decisions. So each person is sent to the closest hospital (we model NHS trusts rather than individual hospitals) to their household. We'll probably improve this in the future so we would have to keep it general.

You have stated things like:

For a leisure time-step

Similarly, for a commute time step

So, am I correct in deducing from this that each timestep can be categorised according to one of the activities (based on the datetime e.g. naturally overnight most people will be in their residence, etc.)?

Refering to the previous config file I mentioned, I call a leisure time-step a a timestep that contains leisure as one of their activities. And you can also see the rest of the schedule (and durations) there. That config file can be changed at the will of the user in a very flexible way. The activity_hierarchy can be found at the top of june/activity/activity_manager.py.

You mention some checks, i.e:

For instance, a worker first checks if he needs to go to work, otherwise he stays home. A retired person first checks leisure, then stays home, etc.

From the way you have worded this, it sounds like these checks are conducted on a case-by-case i.e. person-by-person basis, rather than managing this on a group level e.g. we take X % of workers and send them to work in a given scenario. Is that correct? And if so, is there a reason it is done that way? I may be missing something important in my understanding here.

Yep, that's the case. This is the time step for the activity manager, specifically, the move_people_to_active_subgroups function in june/activity/activity_manager.py, which loops over all the people ni the world and assigns them a subgroup, taking into account the active policies. We went for the option of doing it individually to give us the most amount of flexibility, so that we could have really individual policies (based on person, age, sex, work sector, etc.).

Finally, an off-topic (at least, non-optimisation) question that has came to me:

Does the model account for people shopping (for essentials, mainly)? I can't see any explicit mention across activities or "social" venues, but given that this was all we were allowed to do during the March to ~July lockdown I would have thought it was an important category of activity. So I am just wondering how that is included, if at all, in the model.

Yes, we have the geolocations of every single grocery shop in the UK, and people can go shopping during their leisure time. This is one of the activities that it is not shut down after lockdown. The frequency at which people go shopping is taken from the UK free time survey, as well as the timings for other lesiure activities. We do not model panic buying during the week of 16th of March onwards, but we could certainly do so. We also do not ucrrently model which shops would be closed during lockdown.

florpi commented 4 years ago

I've seen commute mentioned quite often on this thread, but I just wanted to clarify that although individual commute carriages are filled up with random people every time commute is called, we only have commute within major cities. Therefore, this shouldn't be an issue at all if the code is parallelized based on NHS regions. Right now the most common problem will be someone living in a different region to the one they work (these people do not commute, they teleport). There will also be a few kids going to school in a different region, or a few people close to the region's border whose favorite pub/grocery shop/cinema is in a different region

sadielbartholomew commented 4 years ago

Thank you both, they are really useful answers and comments. We are going to continue to understand the model and have a think about the things you have said and what we have learnt so far, but I am sure we will get back to you soon about this idea, at the latest next week when Bryan is back from leave.

bnlawrence commented 4 years ago

Thanks for all the investigations. Can I just explore what happens now if you just do a region (e.g. London, and whatever regions you are doing for calibration).

What happens to People who would be "teleported" elsewhere not in the region? (Presumably teleported somewhere in the region?)
What happens to people who live on the edges? Are their (e.g. leisure) activities constrained to be selected from (e.g. pubs) within the region?
Presumably you just ignore the missing people who might have been teleported in?

arnauqb commented 4 years ago

What happens to People who would be "teleported" elsewhere not in the region? (Presumably teleported somewhere in the region?)

They get allocated somewhere else. If we only run London, and someone works in Manchester, then that person would be allocated to a random company in the London region. For schools, we look for the closest schools to someone's house, if they are in the border, then they would be allocated to a school inside.

What happens to people who live on the edges? Are their (e.g. leisure) activities constrained to be selected from (e.g. pubs) within the region?

Yes, same as above for schools.

Presumably you just ignore the missing people who might have been teleported in?

That's correct.

bnlawrence commented 4 years ago

Ok, thanks, to start thinking about mechanisms then:

we need a place to put people when they shouldn't be "in the region", so the "regional calculation" can ignore them when they're not around.
we need a place to look to find people who should be in the region for this timestep
in both cases we would need to know where they should be going-to (in the-other/in-this region).

This list of people needs to be updated by more than one region over time, but only ever one region at any one timestep, apart from the point where they get moved.

All regions would have a halo made up of N sub-halos (which are on the boundary or from where folks teleport) and each of these would have a list of people who could be moving.

At the point where people are infected and symptomatic, they're at home or hospital, either way, they don't move (across regions)
Otherwise they might move (they may choose to go to a pub across the boundary or not) or it might or might not be a work day.

Does this sound sensible so far?

bnlawrence commented 4 years ago

(I am assuming the sum of people in the halo would be much smaller than the number of people in the region for any region of any significant size.)

arnauqb commented 4 years ago

Ok, thanks, to start thinking about mechanisms then:

we need a place to put people when they shouldn't be "in the region", so the "regional calculation" can ignore them when they're not around.

we need a place to look to find people who should be in the region for this timestep

in both cases we would need to know where they should be going-to (in the-other/in-this region).

This list of people needs to be updated by more than one region over time, but only ever one region at any one timestep, apart from the point where they get moved.

Sounds good, we already explored this idea with a special group called "boundary" where we would put the people in the region that work outside the region. However we dropped it for simplicity and just randomly distributed the workers inside the region. But yes, having this ingoing / outgoing hubs sounds the way to go.

All regions would have a halo made up of N sub-halos (which are on the boundary or from where folks teleport) and each of these would have a list of people who could be moving.

At the point where people are infected and symptomatic, they're at home or hospital, either way, they don't move (across regions)

One thing to keep in mind is that the decision whether they stay home or not when they are symptomatic is based on policies so it changes over time and we might make it region dependent as well. But that should not be an issue.

Otherwise they might move (they may choose to go to a pub across the boundary or not) or it might or might not be a work day.

Does this sound sensible so far?

Sounds really good to me. Is there anything code related you want us to look at? I'm afraid I have zero experience with process communication in Python, but happy to help in other areas.

PD: as an Astronomer I really like the halo / sub halo naming.

bnlawrence commented 4 years ago

Great. The simplest way to implement this would be to do nothing very intellectual. We could run each region as a standalone executable, and communicate via files. This would probably be the easiest and fastest way to test the idea. If each timestep is slow enough (and it looks like it probably is), then loading and unloading a list of tuples which correspond to (for each person actually moving)

(person_id, group_to_join, health status updates)

might be efficient enough. Each executable would have to wait on a file with the right timestep name at the end of each timestep. We could hide this all in a method which we could later use for a more sophisticated strategy. What we wouldn't do is pickle/depickle the person, because the person would be in each region, we'd simply have a method to update the status of the person based on what we want to pass between regions, which is a very small subset of person attributes.

I think the next step would be to think about what we need to do to set it up in terms of the world config and the various boundary lists. I have a meeting this afternoon. When that is done, I will return to this :-) - but it might be we should have a quick chat on the zoom tomorrow.

bnlawrence commented 4 years ago

(We'd have all sorts of load balancing problems, but that'd be a nice problem to have, it'd mean we had some real parallelism in play. If we get this right there is no a priori reason why we have to do this on NHS regional boundaries, we could go smaller, it will all depend on the balance of work versus coms costs.)

arnauqb commented 4 years ago

Sounds great. I think you only need to run the health status updates in the persons' native region. We can decouple the hospitalisation from there so that we can do it locally.

We normally save the world into an hdf5 file, so we could create one big hdf5, and each regional process reads the area of interest. Then you only have to communicate where that person goes, and whether that person got infected or not, I would then allocate an infection always in the process where the persons' household is.

I am free for a Zoom call pretty much anytime this week.

bnlawrence commented 4 years ago

Anyone else tracking this let me know, and I'll email you coordinates for a call at 10.30 this morning.

valeriupredoi commented 4 years ago

We have us a first fully working (famous last words :grin: ) prototype of parallel run on two separate regions with influx/outflux of workers during AM times! Have a look here https://github.com/valeriupredoi/JUNE-1 (master branch): I gave @arnauqb and @florpi admin rights, @grenville can write to the repo and the usual suspects @bnlawrence and @sadielbartholomew can snoop around :mag: First off - I ran it on my two-core laptop (with mpirun -v -np 2 -npersocket 2 python run_simulation.py) and having two processes for two (hardcoded) areas, total population for both areas of 163k people, took a very stable 346s (as compared to the serial time for the same people of about 520-540s). memory stays the same as per the serial process for each of the two parallel processes at about 550M peak (there are a few places in the code where memory can be released, especially in the parallel.py module where there are people that are not used per sub-area, the binable people - but for that to happen a lot of refs have to be removed). This is what @bnlawrence started a few days ago and @grenville and meself have made it into a working prototype, there's a lot to be added and generalized but I thought you'd be happy to see&test this before the weekend, to start it off on a good note :grin: PS @grenville is currently testing it on LOTUS/Jasmin

arnauqb commented 4 years ago

Amazing! Can't wait to have a look

valeriupredoi commented 4 years ago

@arnauqb just a heads up I just tested it on JASMIN on sci5 (one of the scientific nodes which is usually pretty clogged up) and it ran on 2 procs at 426s (serial would take about 700s) and on 4 procs (mpirun -v -np 4 -npersocket 2 python run_simulation.py) at 288s (wowza!) and sorry, I was wrong - the areas are not hardcoded to 2 areas, the domain gets split in as many areas as processes is run with (I tested it so many times on 2 on my 2-core laptop that I hardcoded that info in my brain, @grenville pointed that out :grin: ). OK, fish and chips time now, have a good weekend!

arnauqb commented 4 years ago

Hi Valeriu,

thanks! It looks very promising, I will write some functions so that the domain decomposition is done while we load the world from the hdf5, otherwise loading n amount of worlds and then clearing memory is going to kill us.

arnauqb commented 4 years ago

PD: you really like fish and chips

valeriupredoi commented 4 years ago

oh absolutely (on both accounts :grin: ) Yeah man, there's a few pointers in the comments where I said this is defo not nice and optimal but I didn't know how else to do it. Bryan mentioned the people binning too. But I think it's in a good enough shape to form a prototype that you guys can make it a production Ferrari :car: I am happy there is conservation of number of people and infections across processes for now :grin:

arnauqb commented 4 years ago

You may have realised there has been quite a few changes in the infection modules recently. I have basically cleaned lots of attributes that were not used, and removed HealthInformation as it was quite redundant with infection. I have merged your master's branch with June's master branch and solved all the conflicts here:

https://github.com/IDAS-Durham/JUNE/tree/parallelisation/mpi

valeriupredoi commented 4 years ago

@arnauqb man cheers muchly! I grabbed that branch and all runs well :beer: Moreover, we can run with slightly more initial infections compared to my fork, we have just noticed that upping the number of initial infections makes the thing grind to a halt (actually not even start the main loop): my fork belyups at > 60 init infections whereas parallelisation/mpi can run with 125-130 init infections; @grenville had a look into it already, I'll do some investigations too after lunch :+1:

grenville commented 4 years ago

When the model gets stuck each cpu is running at 100% so I'm guessing the slow down is pickling related. I've not been patient enough for it to get past the hold up -- it is a bit odd that the model simply hangs up and doesn't just run slower. If instead of moving infections around, transmissions are moved, the model runs OK (I've not tried with more than 500 initial infections) - I guess we need to rethink how to pass around the relevant information (this has already been raised as a performance issue). @arnauqb - I'm not sure exactly what needs to be moved - is the list of attributes easily compiled.

valeriupredoi commented 4 years ago

@grenville @arnauqb I found the bottleneck: it's the send/receive comms: specifically for me the first batch of send is not happening comm.send(tell_them, dest=other_rank, tag=100) from the PM/wknd bit in parallel.py - if tell_them is small (order 5 dict items) it sends and receives it no problemo, if it gets bigger (order 20 items) it refuses to send and just hangs around

JosephPB commented 4 years ago

Hi sorry, just catching up on this since I've been on leave. This is all looking great! Just a note on commute - with PR #313 you can now call person.commute.group.super_area if person.commute is not None. Hope this is helpful.

Something to note here is that the super_area assigned to the commute group is that where the central station of that city resides.

valeriupredoi commented 4 years ago

@grenville @arnauqb I found the bottleneck: it's the send/receive comms: specifically for me the first batch of send is not happening comm.send(tell_them, dest=other_rank, tag=100) from the PM/wknd bit in parallel.py - if tell_them is small (order 5 dict items) it sends and receives it no problemo, if it gets bigger (order 20 items) it refuses to send and just hangs around

OK and those are Infection objects - hefty hefty - so it's pickling those until they pickle :pick: So if we can use transmission as @grenville points out, maybe Transmission object is less hefty. Can we not just pass around the PID (Person ID) and convert it to infected via person.infected = True when they cross the border? I mean, do they have to store all the information in the infection attribute?

valeriupredoi commented 4 years ago

assembling the data to be sent via comms into a dictionary gives us an idea of how much is too much for the pickler:

            for person in outside_domain:
                person.busy = True
                if person.infected: # it happened at work!
                    tell_them[person.id] = person

            persdict = {}
            for pid, obj in tell_them.items():
                persdict[pid] = {}
                persdict[pid]["infection"] = {}
                persdict[pid]["transmission"] = obj.infection.transmission
                persdict[pid]["infection"]["tag"] = obj.infection.symptoms.tag
                persdict[pid]["infection"]["time_exposed"] = obj.infection.symptoms.time_exposed
                persdict[pid]["infection"]["time_of_symptoms_onset"] = obj.infection.symptoms.time_of_symptoms_onset
                persdict[pid]["infection"]["trajectory"] = obj.infection.symptoms.trajectory
            comm.send(persdict, dest=other_rank, tag=100)

send with 100 initial infections is OK, more than that a no-no. Commenting out persdict[pid]["infection"]["trajectory"] = ... will allow to send OK for about 250 initial infections, more than that nope. So reducing size of sent objects is paramount to be able to MPI proper :beer:

arnauqb commented 4 years ago

We only really need to send two numbers: person.susceptibility and person.infection.transmission.probability (if that person is infected). When the person "comes back" we only need to know if we need to infect them or not.

bnlawrence commented 4 years ago

We only really need to send two numbers: person.susceptibility and person.infection.transmission.probability (if that person is infected). When the person "comes back" we only need to know if we need to infect them or not.

I think we only need a simple method on person to generate and receive these numbers, but as I said in the call, for now , I think we should keep that method sitting inside the parallel code. I'm sorry I failed to make that clear before I went on leave, it seems like it's held us up by a day or two ...

valeriupredoi commented 4 years ago

We only really need to send two numbers: person.susceptibility and person.infection.transmission.probability (if that person is infected). When the person "comes back" we only need to know if we need to infect them or not.

@arnauqb cheers for the clarification man! I reckon in that case it should be easy-peasy!

@bnlawrence no worries, I was 100% in serial land last week before I started looking at your implementation, so am sure you said it but I didn't register it :+1:

bnlawrence commented 4 years ago

@arnauqb Do we only need to update those if someone is actually infected while away? Otherwise there is no new information right?

arnauqb commented 4 years ago

Yes, exactly. Susceptibility only changes when you are infected (it is set to 0), and the transmission probability is updated at every time step but only for the infected people.

bnlawrence commented 4 years ago

Current status overnight is that the parallelisation/mpi branch in my fork nearly works, but has a bug associated with the infector_id. I think that might well be an actual bug (insofar as it happens before anyone is exchanged), but it's a fortunate bug in that it shows a major problem with the current parallelisation strategy. The problem occurs here (in simulator.py):

        for infector_id in chain.from_iterable(int_group.infector_ids):
                  infector = self.world.local_people[infector_id - first_person_id] # why this?  # V: good question!
                  assert infector.id == infector_id
                   infector.infection.number_of_infected += (
                                n_infected
                                * infector.infection.transmission.probability
                                / tprob_norm

Recall that the strategy is that we only have a local world of people who live in a domain, or who live in a domain, but work elsewhere. Everyone else needs to be removed (eventually) so the memory per domain is a fraction of the entire world domain. People can work in other domains.

Unfortunately that means someone can be infected in "the other domain" by someone who does not live in their domain, then come back to their domain, and infect someone. This will break the way this is done insofar as we would need access in our domain to the infection status of a person who does not live (or work) here.

There are a number of strategies we could use to fix this, but we'd need to decide them with the core team, as they all involve somewhat more intrusive code changes than those so far.

bnlawrence commented 4 years ago

(Actually, I don't think this is a bug of mine, it is a direct result of the idea that the person.id can be used to index into the population since the population is an ordered list. That's clearly going to break when we have local populations which are a subset of the world population.)

arnauqb commented 4 years ago

Relying on world.people being an ordered list always worried me.

We generate the ids for everything automatically, that means that if, in the same Python session, you create multiple worlds, the ids of the people in the worlds after the first one won't start with 0, that's why I correct with a shift so that I start at the first person of the population for every particular world.

We could make world.people be a dictionary and the problem would be solved right? As for the particular issue, the infector.infection.number_of_infected is only used to calculate R0 and nothing else, so I would deactivate that now and maybe we can figure something out like writing the infector's id to the logger file directly, and add it up there.

bnlawrence commented 4 years ago

That's really helpful. Indeed using a dictionary sorts out a number of problems, and we can do that in the new DomainPopulation class which (at some point) ought to be a subclass of Population - at some future time we can bring those together. It will come with some memory overhead though ...

I have been thinking about the infected problem, and indeed, it can be solved by doing the infection stats post-fact via the logger. There are other methods too, but for now, we'll hack it.

bnlawrence commented 4 years ago

(At the moment we have a rather nasty hack called local_people which replaces people, but as I say, that can be more elegantly handled once it all works as expected.)

florpi commented 4 years ago

I wanted to share this with you to show to you how important and timely is the work you are doing on parallelization. So far, we have been running individual NHS regions to calibrate the model. However, we now know that to get a consistent fit across regions we need inter-regional mixing and therefore we need to run the whole of England, since the corrections are not small. For whole England runs the current code is however too slow. Here are some results after two days of running England (for one example run),

So we definitely need to speed this up! Thanks for the work you are doing

bnlawrence commented 4 years ago

Thanks for the info. Sorry it's taking so long.

florpi commented 4 years ago

It's not taking long at all! I just wanted to share it so that you see the end use of your work :)

bnlawrence commented 4 years ago

Morning. Current status is that we have things happily running for 30 odd timesteps, but breaking due to someone being busy (we think) at the beginning of an activity_manager.move_people_to_active_subgroups. It's possible that's our fault, but is there any situation that would mean someone could be busy after a clear_world?

florpi commented 4 years ago

we've had problems with zombies in the past, can you please check whether the person that is busy is also dead? It'd be good to look at their symptoms and activities too, you can do person.infection.symptoms.tag and [activity.group.spec for activity in person.activities if activity is not None]

grenville commented 4 years ago

Hi - we're trying to figure out the absolute minimal amount of information to pass around for people living in one domain and working in another. If we can figure out this, we'll be one step ahead - the most difficult case (as I see it) is: if a person is first infected in the work [home] domain - that person in the home [work] domain halo does not yet have an infection attribute - passing the entire infection class is too expensive in comms. Can we simply pass the person infection status and the infection start_time for example and allow the healt_status_update to fill in the remaining fields (transmission, symptoms ...)

arnauqb commented 4 years ago

Hi Grenville,

Let us call the domain where the person has their household the home domain, and the work domain where the person works. I think home domain should handle all the infection business, so if the person gets infected at work, the work domain tells the home domain to generate an infection for that person.

Once a person is infected, if that person is sent away to another domain (where that person will not have an infection), the home domain sends person.susceptibility and person.infection.infection_probability. The Interaction module only handles people ids, susceptibilities, and transmission probabilities, so we do not need to have an infection created in the person copy of the work domain. When we do interactive_group = InteractiveGroup(group) in simulator.py, the relevant infection information is extracted from the people in the group, so I think that would be the place to read the susceptibiltiy / transmission for the infected person that is coming from another domain.

valeriupredoi commented 4 years ago

quick update on the parallel run - it runs to final, with what we believe it is the correct setup and inter-domain comms; quick question for the code gurus: the reason why we had an issue with a spurious couple of persons that were extra to the domain population is that in june/groups/group/subgroup.py when calling append a few non-active persons were appended; I solved that temporarily by adding a conditional in append:

    def append(self, person: Person):
        """
        Add a person to this group
        """
        if person.active:
            self.people.append(person)
            person.busy = True

this is obviously not the way for stable code since active is set inly if parallel_setup is done, but we need to locate where the non-active person is coming from. I did a bit of black-box testing and I located the problem in leisure.py where mates are being assigned:

        if random() < probability:
            for mate in person.residence.group.residents:
                if mate != person:
                    if mate.busy:
                        if (
                            mate.leisure is not None
                        ):  # this perosn has already been assigned somewhere
                            mate.leisure.remove(mate)
                            mate.subgroups.leisure = subgroup
                            subgroup.append(mate)
                    else:
                        mate.subgroups.leisure = (
                            subgroup  # person will be added later in the simulator.
                        )

indeed when mate is not busy, that mate is not active. But I completely lost track of it outside leisure - is it added in simulator as the comment says? And if so where? Cheers guys! :beer:

valeriupredoi commented 4 years ago

Hi Grenville,

Let us call the domain where the person has their household the home domain, and the work domain where the person works. I think home domain should handle all the infection business, so if the person gets infected at work, the work domain tells the home domain to generate an infection for that person.

Once a person is infected, if that person is sent away to another domain (where that person will not have an infection), the home domain sends person.susceptibility and person.infection.infection_probability. The Interaction module only handles people ids, susceptibilities, and transmission probabilities, so we do not need to have an infection created in the person copy of the work domain. When we do interactive_group = InteractiveGroup(group) in simulator.py, the relevant infection information is extracted from the people in the group, so I think that would be the place to read the susceptibiltiy / transmission for the infected person that is coming from another domain.

awesome, cheers @arnauqb :beer: Could you spare a coule seconds talk about that mate too pls :beers:

arnauqb commented 4 years ago

Hi Valeriu,

yes, sure, do you want to have a zoom call? I'm available for the rest of today.

valeriupredoi commented 4 years ago

@arnauqb man, am done for today and am taking tomorrow off (JUNE-off that is, gonna have to do some other work stuffs), let's have us a meeting on Tuesday when @bnlawrence is back too :beer:

bnlawrence commented 4 years ago

quick update on the parallel run - it runs to final, with what we believe it is the correct setup and inter-domain comms; quick question for the code gurus: the reason why we had an issue with a spurious couple of persons that were extra to the domain population is that in june/groups/group/subgroup.py when calling append a few non-active persons were appended; I solved that temporarily by adding a conditional in append:
    def append(self, person: Person):
        """
        Add a person to this group
        """
        if person.active:
            self.people.append(person)
            person.busy = True
this is obviously not the way for stable code since active is set inly if parallel_setup is done, but we need to locate where the non-active person is coming from. I did a bit of black-box testing and I located the problem in leisure.py where mates are being assigned:
        if random() < probability:
            for mate in person.residence.group.residents:
                if mate != person:
                    if mate.busy:
                        if (
                            mate.leisure is not None
                        ):  # this perosn has already been assigned somewhere
                            mate.leisure.remove(mate)
                            mate.subgroups.leisure = subgroup
                            subgroup.append(mate)
                    else:
                        mate.subgroups.leisure = (
                            subgroup  # person will be added later in the simulator.
                        )
indeed when mate is not busy, that mate is not active. But I completely lost track of it outside leisure - is it added in simulator as the comment says? And if so where? Cheers guys! 🍺

So I don't understand. If I simply comment out the else part of this loop, I would have thought it would remove the problem. But it does not ...

IDAS-Durham / JUNE

parallelisation by NHS region #282