Deltares / imod-python

🐍🧰 Make massive MODFLOW models
https://deltares.github.io/imod-python/
MIT License
19 stars 3 forks source link

waterbalance calculation is slow...? #873

Closed JaccoHoogewoud closed 8 months ago

JaccoHoogewoud commented 8 months ago

Hi, I calculate the riverbudgets per day for a given zone, and store the values in a list. It works but is unexpectedly (8 hours, for 11 years of daily data with grid aprox 320*320) slow, running the groundwatermodel is almost as fast (16 hours).

    bdg = imod.idf.open(f'{bdg_naam}_*.idf')
    zone =imod.idf.open(zone.idf)
    # select part of needed bdg
    bdg = bdg.sel(time=slice('2012-01-01', '2021-01-01'))
    bdg = bdg.where(zone>0)
    bdg = bdg.sum(dim='layer') # i have 4 layers
    bdg = bdg.where(bdg>0)     # only get the infiltration flux
    result =[]
    for datum in bdg.time:
        dagwaarde = bdg.sel(time=datum.values).sum().values.item()
        result.append([pd.Timestamp(datum.values),dagwaarde])

How should I speedup?

thx, Jacco.

PS: I know this is not really a bug, but perhaps a lack of skill from my part. On the other hand it might be nice if there was a "how do i" do efficient waterbalance calculations.

JoerivanEngelen commented 8 months ago

Hi Jacco,

This is because of the lazy evaluation. We wrote something about this in the documentation. Lazy evaluation is especially useful for out of memory computations. iMOD Python loads its IDF data always as dask arrays, chunked per layer per time. Loading data into memory at the right time can speed things up a lot. You can load data into memory manually by calling the .compute() method. You can do this often best after selecting data, if the selected piece of data fits into memory.

If your selection fits in memory, you can try:

    bdg = imod.idf.open(f'{bdg_naam}_*.idf')
    zone =imod.idf.open(zone.idf).compute()
    # select part of needed bdg
    bdg = bdg.sel(time=slice('2012-01-01', '2021-01-01')).compute()
    bdg = bdg.where(zone>0)
    bdg = bdg.sum(dim='layer') # i have 4 layers
    bdg = bdg.where(bdg>0)     # only get the infiltration flux
    result =[]
    for datum in bdg.time:
        dagwaarde = bdg.sel(time=datum.values).sum().values.item()
        result.append([pd.Timestamp(datum.values),dagwaarde])

I think the code can be further optimized, by avoiding doing the bdg.where(zone>0) twice, and making better use of the sum() method:

    bdg = imod.idf.open(f'{bdg_naam}_*.idf')
    zone =imod.idf.open(zone.idf).compute()
    # select part of needed bdg
    bdg = bdg.sel(time=slice('2012-01-01', '2021-01-01'))
    # Load into memory after selection
    bdg = bdg.compute()
    bdg = bdg.sum(dim='layer') # i have 4 layers
    bdg = bdg.where((bdg>0) & (zone>0))     # only get the infiltration flux
    dagwaardes = bdg.sum(dim=["y", "x"])
    timestamps = [pd.Timestamp(t) for t in dagwaardes.coords["time"]]
JoerivanEngelen commented 8 months ago

I'm converting this issue to a discussion on the discussion board, which is the best place to ask user questions! If you are not sure if you are running into a bug or doing something wrong, feel free to post an issue.

Huite commented 8 months ago

(I see I'm cross-posting with Joeri...)

Hi @JaccoHoogewoud,

That is indeed ridiculously slow. To be fair, I don't see things in your example that immediately suspicion, but there are some things which could lead to some inefficiencies and the loop isn't necessary either.

Here's how I would write it, making use of the fact that you can sum over multiple dimensions at once with xarray:

budget = imod.idf.open(f'{bdg_naam}_*.idf').sel(time=slice('2012-01-01', '2021-01-01'))
zone = imod.idf.open("zone.idf")
zone_budget = budgets.where(zone > 0)
zone_infiltration = zone_budgets.where(zone_budget > 0)
timeseries_infiltration = zone_infiltration.sum(["layer", "y", "x"]).compute()

Everything in the snippet above should be fast except the last line since it's "lazy evaluation".

If you want a DataFrame timeseries:

timeseries_infiltration = zone_infiltration.sum(["layer", "y", "x"]).to_dataframe()

This will also force a compute, since pandas DataFrames are always eager (they don't support lazy evaluation).

The reason your implementation is slow is maybe due to manually sel'ing all timesteps, forcing a sum (still all delayed evaluation), then forcing the computation with the .values.item(). Most of the time that something like is very slow, it's because during execution, the IDF files are repeatedly read. This is useful if you have large data, so you can work on data bigger than your RAM.

In your case, if all timesteps fit in memory, I expect you can also do this to fix performance.

    ...
    bdg = bdg.where(zone>0)
    bdg = bdg.sum(dim='layer').compute()
    ...

But I'd try the snippet I posted above first.

I wrote this part of the documentation to give a little idea of the delayed evaluation: https://deltares.github.io/imod-python/user-guide/06-lazy-evaluation.html