Using Dask distributed may slow down preprocessor steps running before concatenate

sloosvel commented 1 year ago

With the great introduction of the possibility of configuring Dask distributed in #2049, I tried to run a test recipe using our model's high res configuration. The model files are splitted in chunks of one year.

High res data with the same resolution is also available in DKRZ in (for example) here: /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ/HighResMIP/EC-Earth-Consortium/EC-Earth3P-HR/control-1950/r1i1p2f1/Omon/tos/gn/v20181119

So preprocessor steps (fix_file can be ignored) load, check_metadata and concatenate have to be applied to multiple iris cubes before all the data is gathered in a single cube after concatenation. High res data in irregular grids also has the extra challenge that the coordinate and bound arrays are quite heavy and if they get realised multiple times, it can make your run hang due to memory issues (see https://github.com/SciTools/iris/issues/5115).

Running a recipe with two variables (2 tasks, without a dask.yml file), 92 years, in a single default node in our infrastructure tends to take in these steps:

2023-05-30 11:09:34,169 UTC [434129] DEBUG   Running preprocessor step load
... Less than a minute later
2023-05-30 11:10:17,494 UTC [434129] DEBUG   Running preprocessor step fix_metadata
.... Almost a minute later
2023-05-30 11:11:09,875 UTC [434129] DEBUG   Running preprocessor step concatenate
.... 5 minutes later
2023-05-30 11:16:03,287 UTC [434129] DEBUG   Running preprocessor function 'cmor_check_metadata' on the data

Whereas in a SLURMCluster with this configuration (not sure if it's an optimal configuration) using two regular nodes:

  cluster:
    type: dask_jobqueue.SLURMCluster
    account: bsc32
    cores: 8
    memory: 32GB 
    local_directory: "$TMPDIR"
    n_workers: 8
    processes: 4
    job_extra_directives: ["--exclusive"]
    job_directives_skip: ['--mem', '-p']
    walltime: '08:00:00'
    interface: "eno1" # for some reason I cannot use ib0 nor ib1

2023-05-30 15:57:53,666 UTC [74260] DEBUG   esmvalcore.preprocessor:369 Running preprocessor step load
... Less than a minute later
2023-05-30 15:58:39,421 UTC [74260] DEBUG   esmvalcore.preprocessor:369 Running preprocessor step fix_metadata
... More than a minute later
2023-05-30 16:00:15,874 UTC [74260] DEBUG   Running preprocessor step concatenate
... 7 minutes later
2023-05-30 16:07:10,059 UTC [74259] DEBUG   Running preprocessor step cmor_check_data

And using even more resources`with 4 nodes but keeping the other parameters the same (again, not sure if optimal) :

  cluster:
    type: dask_jobqueue.SLURMCluster
    account: bsc32
    cores: 8
    memory: 32GB
    local_directory: "$TMPDIR"
    n_workers: 16
    processes: 4
    job_extra_directives: ["--exclusive"]
    job_directives_skip: ['--mem', '-p']
    walltime: '08:00:00'
    interface: "eno1"

2023-05-30 14:10:56,757 UTC [5985] DEBUG   esmvalcore.preprocessor:369 Running preprocessor step load
...Less than a minute later
2023-05-30 14:11:39,935 UTC [5985] DEBUG   esmvalcore.preprocessor:369 Running preprocessor step fix_metadata
... Two minutes later
2023-05-30 14:13:46,064 UTC [5985] DEBUG   esmvalcore.preprocessor:369 Running preprocessor step concatenate
... Eight minutes later
2023-05-30 14:21:06,361 UTC [5985] DEBUG   esmvalcore.preprocessor:369 Running preprocessor step cmor_check_metadata

Our VHR data (which is not available on ESGF) behaves even worse because the files are splitted in chunks of one month for monthly variables. So you can get stuck concatenating files for 30 min. My guess is that the loop over the cubes maybe does not scale well? All examples are run in nodes that got requested exclusively for the jobs. But I also don't know if the cluster configuration is just plain bad. I tried many other configurations (less memory, more cores, more number of processes, more nodes, a combination of more everything) and none seemed to get better though.

I also had to run this with the changes in https://github.com/SciTools/iris/pull/5142, otherwise it would not have been possible to use our default nodes. Requesting higher memory nodes is not always a straight forward solution because it may leave your jobs in the queue for several days.

bouweandela commented 1 year ago

`interface: "eno1" # for some reason I cannot use ib0 nor ib1`

Maybe your compute cluster doesn't have Infiniband? Could you have a look at the network interfaces you have available on the compute nodes? You can list them with srun ip link show (assuming you're using slurm). Network interfaces starting with en are ethernet interfaces, so may not be the fastest.

sloosvel commented 1 year ago

It should have infiniband according to the technical specs, and it shows in the network interfaces:

5: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 256
    link/infiniband a0:00:02:20:fe:80:00:00:00:00:00:00:5c:f3:fc:00:00:05:47:5c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
6: ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc mq state DOWN mode DEFAULT group default qlen 256
    link/infiniband a0:00:03:00:fe:80:00:00:00:00:00:00:5c:f3:fc:00:00:05:47:5d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

But setting either configuration, though ib1 seems to be down, leads to this error

RuntimeError: Cluster failed to start: interface 'ib0' doesn't have an IPv4 address

I guess it would be best to ask the sysadmins.

In any case, whether the connection between nodes is faster or not using infiniband, I think the two by two concatenation in concatenate may not be very optimal when dealing multiple files.

bouweandela commented 1 year ago

Ethernet networks are a lot slower than Infiniband, so communicating the data will be slower and could explain the performance issue reported here. My recommendation would be to try and get Infiniband configured. Are you running ESMValCore on the head node or on a compute node? If you're running it on the head node, it is possible that it has a different network setup from the compute nodes. In that case, you can configure the network interface separately for the head node which is running the scheduler and for the workers as in this example: https://github.com/dask/dask-jobqueue/issues/382#issuecomment-594654803.

valeriupredoi commented 1 year ago

not to be laughed at please - but isn't the network card more important than the actual cable/connection type? :grin: Also, we should not assume any Infini or Efini or Fini types of connection in our suggested settings (Efini is a Mazda, not a connection OK :grin: )

sloosvel commented 1 year ago

Well I managed to use infiniband by running the recipe in a compute node instead of in the login node, not much improvement either. In any case, I think there are many issues in the concatenation. The graph in the dashboard looks quite bad when the concatenation starts.

For instance from our side, we are realising the data if the cubes overlap:

https://github.com/ESMValGroup/ESMValCore/blob/ac98c69bcf1eba9afd6e6183304183337c5767c4/esmvalcore/preprocessor/_io.py#L533

I think that it would be worth it to include the concatenation in the list of performance issues that are pending to be solved

valeriupredoi commented 1 year ago

that's an overlap-gated cube @sloosvel - not the biggest problem you have on your hands, it's usually a few years-long at most :grin:

valeriupredoi commented 1 year ago

I think that it would be worth it to include the concatenation in the list of performance issues that are pending to be solved

Indeed!

sloosvel commented 1 year ago

that's an overlap-gated cube @sloosvel - not the biggest problem you have on your hands, it's usually a few years-long at most

It's happening to us that we have a cube with 18000 timesteps getting realised...

valeriupredoi commented 1 year ago

that's an overlap-gated cube @sloosvel - not the biggest problem you have on your hands, it's usually a few years-long at most

It's happening to us that we have a cube with 18000 timesteps getting realised...

oh boy, that's one chunky overlap; you guys using millisecond means data? :rofl:

bouweandela commented 1 year ago

we are realising the data if the cubes overlap

That is something that would need to be fixed first, distributed really doesn't work well if the computation is not fully lazy. I made a pull request to fix it: #2109.

sloosvel commented 11 months ago

As discussed in past meetings, the main issue with the concatenation of many cubes, specially if they contain HR data, comes from the fact that auxiliary coordinate array values need to be compared to ensure they are equal. This comparison happens sequentially in iris and requires to compute the values. We agreed that maybe we could discuss with the iris team to consider performing this comparison only once between all arrays, instead of sequentially. In the meanwhile, since iris allows to ignore the checks on auxiliary arrays, among other additional data in a cube, we agreed on tying the strictness of the concatenation checks to the check_level flag.

I will open a PR with the changes.

bouweandela commented 11 months ago

Did you also open the iris issue? I couldn't find it yet.

bouweandela commented 1 month ago

Iris issue opened https://github.com/SciTools/iris/issues/5750

bouweandela commented 5 days ago

@sloosvel I'm working on parallel coordinate comparison in Iris in https://github.com/SciTools/iris/pull/5926, would you have time to try it out and provide me with some feedback?

ESMValGroup / ESMValCore

Using Dask distributed may slow down preprocessor steps running before concatenate #2073