OMS-NetZero / FAIR

Finite-amplitude Impulse Response simple climate model
https://docs.fairmodel.net
Apache License 2.0
121 stars 61 forks source link

FaIR2.0 array optimisation #98

Closed znicholls closed 1 year ago

znicholls commented 3 years ago

In the GIR repository, it is possible to run FaIR with an S x p x F x t array internally (S is scenarios, p is parameter sets, F is number of forcing species). This provides a massive speed up when running larger perturbed parameter ensembles over multiple scenarios. We should add the ability to do this with the FaIR2.0 implementation here.

My suggestion would be to add a new interface run_perturbed_ensemble (or similar name), similar to run which takes a multi-scenario inp_df and somehow takes multiple configs in its cfg argument. Then I think, with sufficiently care conversion, it would be possible to just use the existing _run_numpy functions, with minor modification. This will hopefully avoid us having to completely re-write the internals (although maybe a re-write is unavoidable).

@njleach this is probably your territory?

njleach commented 3 years ago

I'll take a look and give it a go in my spare time. I'll certainly have a go at first seeing if we can achieve this with identical numpy functions. A few options / choices that I've included in the GIR code but that we may want to re-think (for simplicity / usability) here:

I'll probably come up with more questions once I start going through the code... I think when writing the GIR code one of the harder aspects was deciding whether / trying to make it general (for example, do we want to make it such that the model still runs if a user doesn't input a "multi-parameter/scenario" set to the run_perturbed_ensemble interface)?

znicholls commented 3 years ago

I'd suggest start with a specific use case then generalise, rather than trying to generalise first i.e. it might be simplest to just write one interface first (just copy the one you have in GIR) and see how that looks. From there, it might be pretty easy to see the pattern and generalise. We can always have more than one interface (e.g. one for F x p x S x t, one for F x S x t etc.).

do we want to make it such that the model still runs if a user doesn't input a "multi-parameter/scenario" set to the run_perturbed_ensemble interface

No, if the interfaces are sufficiently thin, maintaining a few of them is much simpler (and clearer, in my opinion) than writing something super general which is impossible to understand. I guess my experience is also that it's easier to write a few interfaces to start, then you start to see the patterns and can contract again thereafter/streamline the implementation so maintenance is much easier.

njleach commented 3 years ago

Very useful advice- in the GIR repo I did try to make it (more) general; and it ended up pretty messy and with some quirks that new users would find confusing (I expect). I'l aim for ending up with multiple much thinner / simpler blocks than a complicated one-size-fits-all chunk.

One package I use a lot already that I'm wondering might help us make the code easy to follow (especially with the multi-dimension stuff) is xarray. Another (minor) benefit of this could also be to reduce memory errors when people try to run massive ensembles by using the inbuilt dask capability of xarray. If we want to avoid extra dependencies it's certainly not necessary as numpy is definitely sufficient, but I'm wondering if it might just make the internals a bit clearer? What are your thoughts?

znicholls commented 3 years ago

I'm a big fan of xarray so definitely plus one from me (would also mean we could use labelled arrays internally which helps debugging)

chrisroadmap commented 1 year ago

I presume, fixed by #112. xarray is what the user sees (labelling), numpy is what FaIR sees (fast).