bnlawrence commented 4 years ago

We have discussed the possibility of using node level parallelisation for regions, and core-level parallelisation for loop.

Probably the easiest regional decomposition to start with would be to look at running each of the NHS regions in parallel and moving people between them as required at the beginning or end of each timestep.

To do that, we need to understand a bit more about the geolocality in the model. Where and how are people moved, and how is their geographical locations used?

With respect to the second question, naively I assume there is some convolution of infection probability and people "met" in the Interaction part of the model. Is that right?

But the first question is the most pressing.

arnauqb commented 4 years ago

There are two mechanisms that "drag" people non-conventionally. One is that when someone decides to go to leisure, they can drag their household with them, so it would try to append this people to the same subgroup. I don't see how this would be a problem, though, because all those people have the same household so they should be in the same domain. This is the one you pointed out.

The other one is the one that drags an adult when a kid is left home unattended, you can find this one in the apply function of individual_policies.py.

bnlawrence commented 4 years ago

Thanks. I think that suggests leisure is not the problem. The problem is elsewhere. There may be something about the children, cross domain ... but I'm not sure of that yet either ... I'm going to create a specific issue for this, as I think it could hurt us later if we don't solve it, but I also think that getting us to a position where we can hand over a functioning parallel update tree is more important. For now.

bnlawrence commented 4 years ago

@arnauqb I've just merged the Durham master into parallisation/mpi with a number of unexpected consequences. In particular, after reading from HDF5, everyone is busy. Noting that normal timesteps start from a clear_world with everyone not busy, would you not then want to clear_world before using these people - or alternatively - why does the initialisation make them busy (given in the HDF file the people themselves are not busy)?

bnlawrence commented 4 years ago

(The side effect of this is that it brings our problems to the first timestep and invalidates V's fix ... which may be no bad thing, but I just want to make sure this is an intended change, and understand why.)

arnauqb commented 4 years ago

The initialisation puts every one on every possible group. This makes it easier to test where everyone is, but we can perfectly change it so that people are not added to every possible group when the world is created.

bnlawrence commented 4 years ago

Well, let's not break your tests, but we can put a clear_world in our branch at the beginning, rather than end of the timestep ...

arnauqb commented 4 years ago

Sounds good

bnlawrence commented 4 years ago

Actually, to be fair, I now find you already do that at the beginning of the simulator.run, so that's not the source of the new problems ... unfortunately ...

bnlawrence commented 4 years ago

I'm going round in circles, probably because of things I really don't understand. E.g. if I put this into activity manager:

def group_count(self, idstring):
        """ For debugging """
        n = 0
        for agroup in self.active_groups:
            ng = 0
            if agroup not in ["household_visits", "care_home_visits"]:
                this = getattr(self.world, agroup)
                if this is not None:
                    for group in this.members:
                        ng += group.size
                print(f'{idstring}: group {this.group_type} has {ng}  people')
                n += ng
        print(f'{idstring}: {n} in total cf {self.world.local_people.debug_stats}')

then if I call this just after move_people_to_active_subgroups, in single processor mode I get this:

Domain 0: group Hospitals has 0  people
Domain 0: group Pubs has 10410  people
Domain 0: group Cinemas has 1447  people
Domain 0: group Groceries has 11171  people
Domain 0: group Households has 144415  people
Domain 0: group CareHomes has 990  people
Domain 0: 168433 in total cf {'dead': 0, 'busy': 163852, 'hospitalised': 0, 'active': 163852, 'all': 163852, 'in': 0, 'out': 0, 'notw': 0}

somehow we have about 5000 too many people, so some people must be in multiple groups? (and in single processor mode this now crashes with a SimulatorError ...)

arnauqb commented 4 years ago

Hi Bryan,

when did you merge? I accidentally merged a PR that had a small bug and I corrected it shortly afterwards. Sorry if this caused you to waste a lot of time, could you merge the latest master?

bnlawrence commented 4 years ago

Ah. Thanks, indeed that was the problem. Yes, I "wasted" a lot of time, but no drama, I learnt a lot about other bits I hadn't looked at properly. Ok, I'm off for a long weekend. More Tuesday.

arnauqb commented 4 years ago

Hi all,

so, I had a fun weekend having a look at how you were implementing the parallelisation. I thought I had a good grasp on how to go about it, so I had a go at starting from scratch. I have a working (in the sense that it runs in parallel, but I haven't carefully checked if the results are correct) parallel implementation here: https://github.com/IDAS-Durham/JUNE/tree/parallelisation/subgroups_mpi

The first thing I did is creating a general way to split the world into domains. This was done in this PR: https://github.com/IDAS-Durham/JUNE/pull/321, and the idea is that given a list of super areas ids, we initialise in a domain:
- The people living in the list of super areas (so no foreign workers, each domain only contains the people that live in it, so no repetition of people).
  - The buildings (schools, hospitals, etc.) that live in the list of super areas.
  - For each person, person.subgroups points to each of the subgroups where they do activities, e.g, person.subgroups.residence points to their subgroup in a household / care home. If a particular subgroup is outside the simulated domain, I create a ExternalSubgroup object (defined in june/groups/group/subgroup.py which mimics the group / subgroup structure, but contains the domain id of where actually that subgroup is.

This means that we are very memory efficient, since the world is pretty much a direct union of the domains:

The original idea was to have two (or more) copies of the same person, one in the home domain, and one in the work domain. However, I think this was complicating things a bit too much, and it was probably not very efficient. So, what I do is:
- In the ActivityManager class (in https://github.com/IDAS-Durham/JUNE/blob/4999947859b1175ca231a7a9428e3e3284073a07/june/activity/activity_manager.py#L283 ) I loop over all the people and assign them to the subgroups relevant for the current time step. If someone has a subgroup which is of the type ExternalSubgroup, I write down the domain id, group spec, group id, and subgroup type of that subgroup, to send it later to the corresponding domain. At the end, I have a dictionary with all the information of all the people that go abroad, and the subgroup information of where they have to be added. Right afterwards, in send_and_receive_people_from_abroad, I exchange the information between domains using MPI calls. This is the first time I use an MPI library, so I would appreciate if someone had a careful look :). Basically, for each person that goes abroad we send the subgroup data of where are they going to, and the person' susceptibility and infection probability.
- Once each domain has the people that are active locally appended to the relevant subgroups, and has a dictionary of the people that are coming from abroad, I go to the Simulatorand I pass both groups of people to the InteractiveGroup class. This is a class that extracts all the relevant information of a group for the interaction. Now, since all we care about are the people ids, suceptibilities, and transmission probabilities, I pass directly the foreigners data here, and they interact with the locals. Check here: https://github.com/IDAS-Durham/JUNE/blob/parallelisation/subgroups_mpi/june/interaction/interactive_group.py . It's still a very rough implementation and can be simplified, as it looks a bit ugly atm. (I trust @valeriupredoi will find a way to make it 2x faster / cleaner :D).
- The Interaction instance, takes the InteractiveGroup instance and runs the interaction on it. Afterwards, it returns a list of infected people. If there are any ids of people who do not live in the simulated domain, I have to let their home domain know. This exchange of information happens here:

https://github.com/IDAS-Durham/JUNE/blob/4999947859b1175ca231a7a9428e3e3284073a07/june/simulator.py#L375

I have also hospitals working, and the idea is very similar to how it works for the other subgroups.
The only things pending are the leisure modules, which are bit more tricky, and the commute modules, which we can't yet properly split it in domains.

If you feel like trying this implementation, you can do it by running

mprirun -np [number_of_cores] scripts/run_parallel_simulation.py

Note that you will have to run the create_world.py script again, without any leisure venue and commute off, and you can use the config_basic.yaml that has no commute / leisure.

I will post some benchmarking results tomorrow, and check if it agrees with the serial version.

PS: I'll be away 2 weeks from Tuesday for some time off.

arnauqb commented 4 years ago

Update: I have leisure working in parallel as well. I will run some tests in big worlds to see how it performs.

PR incoming: https://github.com/IDAS-Durham/JUNE/pull/325

Edit: It is not possible yet to read the loggers generated by the parallel version. @florpi is working on a PR for this.

bnlawrence commented 4 years ago

Great. I'm glad you've had a go at it, at some point that was always going to be the way to go, you need to own how it is done, and you're allowed to make changes we wouldn't (as were trying to be minimally intrusive into the rest of the code).

We'll see if we can move over to this branch, however, at this point I can't run this as we are missing config_nocommute.yaml in the repo ... ... well we can, by creating a new config file with no commute in it, but which is otherwise the same as the standard config.

bnlawrence commented 4 years ago

What we're doing now:

comparing and contrasting what you've done and what we've done, and comparing performance.

Things on our mind:

forcing the domain decomposition a priori makes sense, but what if you want to re-run the same world on different numbers of processors? it might be better to take the serial hit at the beginning of each run, if the memory can be cleaned up (which we think it can).
we really like the use of an external group - it immediately rids us of some of our counting problems.

Anyway, we'll spend a couple more days reviewing, and then maybe suggest a meeting with @florpi ...

florpi commented 4 years ago

I'd be happy happy to chat ! Just added the config_nocommute.yaml file in case you were missing it

arnauqb commented 4 years ago

What we're doing now:

comparing and contrasting what you've done and what we've done, and comparing performance.

Things on our mind:

forcing the domain decomposition a priori makes sense, but what if you want to re-run the same world on different numbers of processors? it might be better to take the serial hit at the beginning of each run, if the memory can be cleaned up (which we think it can).

The world as whole is saved in an hdf5 as usual. Then you create the domains from that, if you have a different number of processors it will split it in a different number of domains, so there wouldn't be a problem in rerunning the same world for different number of CPUs or in serial.

bnlawrence commented 4 years ago

Great. I've now seen a bit more of that in the code. You've done where i was heading when I said I'd want to make my DomainPopulation a World.people instance by instead making a Domain which is a World ... (which sorted out all the loose ends I had with all the other groups). That's completely consistent with what I thought. The only substantive difference in approach seems to be 1) you've pulled all the parallel gubbins into the main code, which means you've exposed the parallelism to the code (something I wouldn't/couldn't do), and 2) you've not fully duplicated the people and in doing so, simplified the information transfer (I don't think in the great scheme of things that would have been that expensive in memory, but doing that forces simplification in thinking about how to pass infection, which is a "Good Thing"). (We were stalled with two bugs, both of which I think just evaporate with your changes - the new groups fix people conservation issues and the infection creation/removal is solved by only doing it properly in one domain.) Good stuff. I have some outstanding things I want to chase down (for my own understanding) then I'll get onto @flori as promised. The real question now is what, if anything more, we can contribute to the coding (I have some science things to discuss, but they are secondary to making sure you do what you want - and there the next step could be working out how to get the UK itself running performantly, which will be about understanding the scaling better).

bnlawrence commented 4 years ago

(When I say "wouldn't/couldn't", I simply meant "given we're not in the core team, I didn't think we should make those sorts of changes", we left them to you, and you've done them. Good.)

bnlawrence commented 4 years ago

I wonder whether Domain should be a subclass of World though ... it's very pythonic to have it behaving like one without being one, but you might want to formally force it to be one, for maintenance reasons ...

florpi commented 4 years ago

What would be the pros of Domain being a subclass of World ?

florpi commented 4 years ago

Regarding first results, this is the first comparison I managed to do on a small world running with two cores,

Both versions took about the same time to run on my laptop, but it was pretty busy so I hope to test it tomorrow in our super computer. Hopefully I'll be able to make some plots with running time as a function of number of cores. At least, it looks like we get the same results (I'll also check that running different realisations of the same simulation we are within one sigma errors)

bnlawrence commented 4 years ago

These results may be useful, on my laptop: 4 cores ~550s, 2 cores ~800s, 1 core ~1400s. The scaling may be better with a larger domain ("weak scaling"). We might be able to improve the strong scaling too, although it might not be worth it.

The use of a subclass will force you to worry more about the interfaces on both classes at the same time, otherwise you risk changes to World not being replicated in Domain and vice-versa. You'll easily trap changes to the interface signature, but making sure functionality follows is harder and may, or may not, be captured by testing.

bnlawrence commented 4 years ago

I think @arnauqb suggested there was a way of forcing the random numbers to be repeatable, if so, I'd hope you can demonstrate that the parallel version gives the same results as a serial version.

bnlawrence commented 4 years ago

It would be good to fully understand how we get to " If someone has a subgroup which is of the type ExternalSubgroup". If I've understood correctly this is achieved by use of subgroup.external (returned in get_personal_subgroup in activity_manager), which is presumably because the group they want to go to isn't in the domain. But I haven't found where the groups are flagged as external ...

arnauqb commented 4 years ago

It would be good to fully understand how we get to " If someone has a subgroup which is of the type ExternalSubgroup". If I've understood correctly this is achieved by use of subgroup.external (returned in get_personal_subgroup in activity_manager), which is presumably because the group they want to go to isn't in the domain. But I haven't found where the groups are flagged as external ...

They are two different classes, which have a class attribute .external. I don't check for isintance because it was one of the things we found is slow. The subgroups and external subgroups are restored to people in '''hdf5_savers/population_saver.py''' at the end of the file in the restore population function. The social venues a person can go to are stored in their household and area, and these are also initialised as subgroup or ExternalSubgroup in '''hdf5_savers/household_saver.py''' and '''hdf5_savers/geography_saver.py''' both at the end in restore... These functions are called in the generate_domain_from_hdf5 defined in hdf5_savers/world_saver.py.

florpi commented 4 years ago

I think @arnauqb suggested there was a way of forcing the random numbers to be repeatable, if so, I'd hope you can demonstrate that the parallel version gives the same results as a serial version.

Yes, the random seed was fixed to be the same for those two runs. However, even if I run the serial code twice with the same seed I don't get the same results. This is how we fix the random seed, but we might be missing something @bnlawrence ?

def set_random_seed(seed=999):
    """
    Sets global seeds for testing in numpy, random, and numbaized numpy.
    """

    @nb.njit(cache=True)
    def set_seed_numba(seed):
        return np.random.seed(seed)

    np.random.seed(seed)
    set_seed_numba(seed)
    random.seed(seed)
    return

set_random_seed()

These are two example runs of the serial code with the same random seed,

arnauqb commented 4 years ago

I think @arnauqb suggested there was a way of forcing the random numbers to be repeatable, if so, I'd hope you can demonstrate that the parallel version gives the same results as a serial version.

Yes, the random seed was fixed to be the same for those two runs. However, even if I run the serial code twice with the same seed I don't get the same results. This is how we fix the random seed, but we might be missing something @bnlawrence ?
def set_random_seed(seed=999):
    """
    Sets global seeds for testing in numpy, random, and numbaized numpy.
    """

    @nb.njit(cache=True)
    def set_seed_numba(seed):
        return np.random.seed(seed)

    np.random.seed(seed)
    set_seed_numba(seed)
    random.seed(seed)
    return

set_random_seed()
These are two example runs of the serial code with the same random seed,

I think that's the outdated one, you also need to do random.seed(seed) inside the set_seed_numba

See top of https://www.github.com/IDAS-Durham/JUNE/tree/master/test_june%2Funit%2Fconftest.py

florpi commented 4 years ago

So with @arnauqb 's correction fixing the seed works,

However, the comparison between serial and parallel code is still not fair,

since we are applying the seed for the infection twice lol

I'll change the InfectionSeed to only return IDs to infect, and then use the simulator infection functions. Hopefully then we'll get the same results

florpi commented 4 years ago

I changed the seed so that it is the same for both runs, and this what I get,

So we will need to do some more digging on what is going on. If anyone wants to play around with it, you can run scripts/run_simulation.py for the serial version and scripts/run_parallel_simulation.py for the parallel one on my branch parallelisation/combine_loggers. To get the plots you can do,

from june.logger import ReadLogger
read_parallel = ReadLogger(output_path='../scripts/results/')
read_serial = ReadLogger(output_path='../scripts/results_nompi/', n_processes=1)
world_parallel = read_parallel.world_summary()
world_serial = read_serial.world_summary()

world_serial['daily_infections'].resample('D').sum().plot(label='Serial')
world_parallel['daily_infections'].resample('D').sum().plot(label='Parallel', linestyle='dashed')
plt.legend()
plt.ylabel('# daily new infections')
plt.xlabel('Date')

bnlawrence commented 4 years ago

Unfortunately I have to do my day job at the moment, so I can't have a go (although I will try and get you some scaling plots overnight, as that requires no brainpower) ... but you can at least save yourself some time by only running for a few days, maybe even for a timestep or two. Once the runs diverge, they've diverged, and the question is why? The rest of the run is just emphasising the point ...

bnlawrence commented 4 years ago

I note that the weekend seems to involve more infections and more divergence ... but I given the previous closer correspondence, this looks like something not in sync that should be in terms of random number generators.

tristanc commented 4 years ago

You can't simply set the same seed across parallel runs because the seed determines the sequence of numbers that come out of the RNG. So even if you set the random seed to the same thing on each of the processes, each different region/process will use it's own version of the sequence.

single proc you would generate random numbers like: A B C D E F G H I J K L M ....

on multiple procs using the same seed it would be proc1: A B C D E ... proc2: A B C D E ... etc.

So this is why you can't match the behaviour exactly between single and multiple processes. You'd need a global RNG that is shared across all the processes in order to exactly replicate.

The runs should be the same/repeatable if you run twice on the same number of processes, though.

Overall, I don't think this is a problem in the short term. Nice to fix in the long term, but not a priority short term.

bnlawrence commented 4 years ago

Ah yes. Good point. We don't use much random number generation.

I think it's not so much "fix" as "be sure that the single and parallel versions are doing the same thing". Once we know that, it's fine for the various different versions to be different. It might be that in the short term, as @flori suggested, we can live with showing that N realisation of np=4 have the same stats as N realisations of np=1, but long term, at some point we should show that we can make them be the same (even if we don't want to do that in production).

(I'd want np>2 to be comfortable in this ... but we wouldn't need to do it for all np )

florpi commented 4 years ago

Yes, my worry is the same one as @bnlawrence . We should be comfortable that we haven't introduced anything spurious due to the parallelization. I'll run N realizations and compare their stats for now. But that is a good point @tristanc , I was wondering how would it work across processes, and the answer is it doesn't haha

bnlawrence commented 4 years ago

I've had an annoying afternoon trying to shepherd some batch jobs along, but they're failing for reasons which have nothing to do with this code, so on the side, I ran the code on a JASMIN science VM with 2, 4 and 8 processors, starting with 500 infections to make things happen a bit faster ... anyway it went from 456, to 300, to 200s with each doubling. I'm thinking that you'd probably get larger domains running in very similar times. Can you set me up with a much larger region to try with (for when I do have the batch stuff running properly)?

florpi commented 4 years ago

Sure! If you feel adventurous we could run on your favorite region, or in the North East, since it is my favorite and the smallest one we have. I'm still fixing some issues with the infection seed for a realistic run, but you could start creating the world since that takes a few hours, and I'll let you know when we can run a realistic test.

To create the North East, got to scripts/create_world.py and change the geography line by:

geography = Geography.from_file({"region" : ["North East"]})

bnlawrence commented 4 years ago

Do I need to use your branch to create the geography?

florpi commented 4 years ago

Not really, but do make sure commute is turned off in https://github.com/IDAS-Durham/JUNE/blob/master/scripts/create_world.py#L39

bnlawrence commented 4 years ago

The first JASMIN scaling curve (from parallelisation/subgroups_mpi) isn't that impressive: simple although I am sure a factor of 4 speed up will be worth having ...

... but there a lots of reasons to think we can do a lot better than this. Firstly, if one asks "why is not better?" the answer will come down to the balance between communications costs (pickling, and moving) and the amount of work. There probably simply isn't enough work per process (and too many people moving between them) with 20 processors and the London50 case.

I am sure we can improve this. The easy improvement is obvious: just run a bigger domain. I'll do that this afternoon and we'll have a look at that. The second is a bit harder, we should look hard at whether or not this approach (which sends all information about all infected people) might not be better a bit more along the lines we were investigating (but hadn't committed when @arnauqb refactored) which was only to transfer information when people change state un-infected -> infected -> uninfected (or dead). That might indeed cost a bit more in memory, but run faster. It's important to remember the memory problems are much less in an MPI world ... we can just use more nodes.

florpi commented 4 years ago

Thanks for the curves! I agree, we need a larger domain to really test it.

Regarding the second, that sounds like an interesting idea. Would you be up for a chat at the beginning of next week?

Something worth exploring too is how do we split the domains. Something we could do to minimise communication between processes, is to split the domains according to the flow data (which determines how many people from a given super area go to work to another super area). In the current implementation, workflow is the main cross-domain movement. I have no idea about how to do this split in an optimal way, but you might have some ideas.

tristanc commented 4 years ago

If the pickling/unpickling itself is causing a lot of overhead it might be worth looking at other ways of sending the data between the domains. There are a lot of methods that are more efficient:

https://www.benfrederickson.com/dont-pickle-your-data/

bnlawrence commented 4 years ago

~~Actually, that's rather embarrassing, I think I ran the wrong code ... hold your thoughts on the scaling!~~ (I did run the right thing even though it was in the wrong place.)

(But yes @tristanc, there are lots of other options, but mpi4py pickles by default unless you use numpy arrays, which is the obvious route, also although I think one can probably force something else into the serialisation we wouldn't normally try that).

bnlawrence commented 4 years ago

simple Northeast runs about 7 times faster with 20 cores. Note the two worlds were run for differing periods (20days and 10days)

FrankKrauss commented 4 years ago

That is pretty awesome!

I would expect all of England to have a similar scaling behaviour, which means we reduce England 2 months in 4 days to over-night ....

On 04/09/2020 15:54, Bryan Lawrence wrote:

simple https://user-images.githubusercontent.com/1792815/92252859-ba9ce080-eec6-11ea-8e2e-1864527d25ac.png Northeast runs about 7 times faster with 20 cores.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/IDAS-Durham/JUNE/issues/282#issuecomment-687197883, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBDWIBAZNOVN2G5MK36FATSED5RRANCNFSM4PZVBVQA.

--

Professor Frank Krauss

Royal Society Wolfson Fellow

Institute for Particle Physics Phenomenology Director, Institute for Data Science Durham University

Durham DH1 3LE United Kingdom

krauss

bnlawrence commented 4 years ago

I suspect the scaling should get better for all of England since there ought to be less communication between the domains the further apart they are. That said, we may end up with lots of regions waiting on London before they can continue, so it could be that London will control the overall speed ... (insofar as there might be the most communication costs between domains there). Lots to tweak (e.g. how we distribute workers to domains). I'd be disappointed if we can't get another factor of two for all England without too much work.

tristanc commented 4 years ago

That's so awesome. If it turns out that London is holding things up guess we could try splitting London in half?

bnlawrence commented 4 years ago

I expect we'll run with o(100) processes for all of England, and so London should already be in bits.

sadielbartholomew commented 4 years ago

I'm largely lurking on this Issue reading the comments without posting :wink:, but (& sorry to interrupt the flow of conversation) I thought I should jump in to say we should be wary about having a long conversation thread like this on a GitHub Issue, or indeed a PR. We're now at ~95 comments and for ~100-200 I have seen other Issues become inaccessible and most of the time lead to an error page saying the page takes too long to load (e.g. described here). This was a few years back, but I have heard it still occurs (as my link suggests).

Strangely, hundreds of commits are managed without issue, but hundreds of comments slow Issues & PRs down & can make them inaccessible :thinking:

Therefore I suggest we create a new, follow-on, Issue, closing this one, at some point soon if we want discussions to go on in this way.

valeriupredoi commented 4 years ago

I fully agree with @sadielbartholomew on this one and would go further and suggest creating a Project for this - it warrants its own milestones, testing and multiple issues. Very cool stuff though :beer:

IDAS-Durham / JUNE

parallelisation by NHS region #282

--

Tel: +44 191 3343751 https://www.ippp.dur.ac.uk/profile/krauss