Open sloosvel opened 1 year ago
`interface: "eno1" # for some reason I cannot use ib0 nor ib1`
Maybe your compute cluster doesn't have Infiniband? Could you have a look at the network interfaces you have available on the compute nodes? You can list them with srun ip link show
(assuming you're using slurm). Network interfaces starting with en
are ethernet interfaces, so may not be the fastest.
It should have infiniband according to the technical specs, and it shows in the network interfaces:
5: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 256
link/infiniband a0:00:02:20:fe:80:00:00:00:00:00:00:5c:f3:fc:00:00:05:47:5c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
6: ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc mq state DOWN mode DEFAULT group default qlen 256
link/infiniband a0:00:03:00:fe:80:00:00:00:00:00:00:5c:f3:fc:00:00:05:47:5d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
But setting either configuration, though ib1 seems to be down, leads to this error
RuntimeError: Cluster failed to start: interface 'ib0' doesn't have an IPv4 address
I guess it would be best to ask the sysadmins.
In any case, whether the connection between nodes is faster or not using infiniband, I think the two by two concatenation in concatenate
may not be very optimal when dealing multiple files.
Ethernet networks are a lot slower than Infiniband, so communicating the data will be slower and could explain the performance issue reported here. My recommendation would be to try and get Infiniband configured. Are you running ESMValCore on the head node or on a compute node? If you're running it on the head node, it is possible that it has a different network setup from the compute nodes. In that case, you can configure the network interface separately for the head node which is running the scheduler and for the workers as in this example: https://github.com/dask/dask-jobqueue/issues/382#issuecomment-594654803.
not to be laughed at please - but isn't the network card more important than the actual cable/connection type? :grin: Also, we should not assume any Infini or Efini or Fini types of connection in our suggested settings (Efini is a Mazda, not a connection OK :grin: )
Well I managed to use infiniband by running the recipe in a compute node instead of in the login node, not much improvement either. In any case, I think there are many issues in the concatenation. The graph in the dashboard looks quite bad when the concatenation starts.
For instance from our side, we are realising the data if the cubes overlap:
I think that it would be worth it to include the concatenation in the list of performance issues that are pending to be solved
that's an overlap-gated cube @sloosvel - not the biggest problem you have on your hands, it's usually a few years-long at most :grin:
I think that it would be worth it to include the concatenation in the list of performance issues that are pending to be solved
Indeed!
that's an overlap-gated cube @sloosvel - not the biggest problem you have on your hands, it's usually a few years-long at most
It's happening to us that we have a cube with 18000 timesteps getting realised...
that's an overlap-gated cube @sloosvel - not the biggest problem you have on your hands, it's usually a few years-long at most
It's happening to us that we have a cube with 18000 timesteps getting realised...
oh boy, that's one chunky overlap; you guys using millisecond means data? :rofl:
we are realising the data if the cubes overlap
That is something that would need to be fixed first, distributed
really doesn't work well if the computation is not fully lazy. I made a pull request to fix it: #2109.
As discussed in past meetings, the main issue with the concatenation of many cubes, specially if they contain HR data, comes from the fact that auxiliary coordinate array values need to be compared to ensure they are equal. This comparison happens sequentially in iris and requires to compute
the values. We agreed that maybe we could discuss with the iris team to consider performing this comparison only once between all arrays, instead of sequentially. In the meanwhile, since iris allows to ignore the checks on auxiliary arrays, among other additional data in a cube, we agreed on tying the strictness of the concatenation checks to the check_level
flag.
I will open a PR with the changes.
Did you also open the iris issue? I couldn't find it yet.
Iris issue opened https://github.com/SciTools/iris/issues/5750
@sloosvel I'm working on parallel coordinate comparison in Iris in https://github.com/SciTools/iris/pull/5926, would you have time to try it out and provide me with some feedback?
With the great introduction of the possibility of configuring Dask distributed in #2049, I tried to run a test recipe using our model's high res configuration. The model files are splitted in chunks of one year.
High res data with the same resolution is also available in DKRZ in (for example) here:
/work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ/HighResMIP/EC-Earth-Consortium/EC-Earth3P-HR/control-1950/r1i1p2f1/Omon/tos/gn/v20181119
So preprocessor steps (
fix_file
can be ignored)load
,check_metadata
andconcatenate
have to be applied to multiple iris cubes before all the data is gathered in a single cube after concatenation. High res data in irregular grids also has the extra challenge that the coordinate and bound arrays are quite heavy and if they get realised multiple times, it can make your run hang due to memory issues (see https://github.com/SciTools/iris/issues/5115).Running a recipe with two variables (2 tasks, without a dask.yml file), 92 years, in a single default node in our infrastructure tends to take in these steps:
Whereas in a SLURMCluster with this configuration (not sure if it's an optimal configuration) using two regular nodes:
And using even more resources`with 4 nodes but keeping the other parameters the same (again, not sure if optimal) :
Our VHR data (which is not available on ESGF) behaves even worse because the files are splitted in chunks of one month for monthly variables. So you can get stuck concatenating files for 30 min. My guess is that the loop over the cubes maybe does not scale well? All examples are run in nodes that got requested exclusively for the jobs. But I also don't know if the cluster configuration is just plain bad. I tried many other configurations (less memory, more cores, more number of processes, more nodes, a combination of more everything) and none seemed to get better though.
I also had to run this with the changes in https://github.com/SciTools/iris/pull/5142, otherwise it would not have been possible to use our default nodes. Requesting higher memory nodes is not always a straight forward solution because it may leave your jobs in the queue for several days.