Increase speed of tidying up after the simulation has finished

tsemmler05 commented 1 year ago

On aleph I ran one year of simulation with large amounts of output:

/scratch/awiiccp5/1950e/

To compare, I also ran one year of simulation with limited output:

/scratch/awiiccp5/1950c_limitedoutput/

In the case of the limited output the result of

stat /scratch/awiiccp5/1950c_limitedoutput/outdata/fesom/vice.fesom.1900.nc

is:

stat /scratch/awiiccp5/1950c_limitedoutput/outdata/fesom/vice.fesom.1900.nc File: '/scratch/awiiccp5/1950c_limitedoutput/outdata/fesom/vice.fesom.1900.nc' Size: 4614185552 Blocks: 9012096 IO Block: 4194304 regular file Device: cdb43cdah/3451141338d Inode: 720577190936304297 Links: 1 Access: (0644/-rw-r--r--) Uid: (20907/awiiccp5) Gid: (14907/ iccp2) Access: 2022-08-26 19:15:49.000000000 +0900 Modify: 2022-08-26 19:11:36.000000000 +0900 Change: 2022-08-26 19:24:06.000000000 +0900 Birth: -

In the case of the large output I get:

stat /scratch/awiiccp5/1950e/outdata/fesom/vice.fesom.1900.nc File: '/scratch/awiiccp5/1950e/outdata/fesom/vice.fesom.1900.nc' Size: 4614185552 Blocks: 9012096 IO Block: 4194304 regular file Device: cdb43cdah/3451141338d Inode: 720577194308531896 Links: 1 Access: (0644/-rw-r--r--) Uid: (20907/awiiccp5) Gid: (14907/ iccp2) Access: 2022-08-30 08:34:45.000000000 +0900 Modify: 2022-08-30 08:27:57.000000000 +0900 Change: 2022-08-30 12:04:29.000000000 +0900 Birth: -

or

stat /scratch/awiiccp5/1950e/outdata/fesom/salt.fesom.1900.nc File: '/scratch/awiiccp5/1950e/outdata/fesom/salt.fesom.1900.nc' Size: 193795468497 Blocks: 378506920 IO Block: 4194304 regular file Device: cdb43cdah/3451141338d Inode: 720577194308533609 Links: 1 Access: (0644/-rw-r--r--) Uid: (20907/awiiccp5) Gid: (14907/ iccp2) Access: 2022-08-30 10:09:21.000000000 +0900 Modify: 2022-08-30 08:27:57.000000000 +0900 Change: 2022-08-30 12:04:30.000000000 +0900 Birth: -

more /scratch/awiiccp5/1950e/log/1950e_awicm3.log gives:

Tue Aug 30 08:28:34 2022 : # Beginning of Experiment 1950e Tue Aug 30 08:28:34 2022 : tidy 1 1900-01-01T00:00:00 652375.sdb - start Tue Aug 30 08:28:34 2022 : tidy 1 1900-01-01T00:00:00 652375.sdb - start Tue Aug 30 12:04:50 2022 : prepcompute 2 1901-01-01T00:00:00 652375.sdb - start Tue Aug 30 12:07:31 2022 : prepcompute 2 1901-01-01T00:00:00 652375.sdb - done Tue Aug 30 12:07:31 2022 : tidy 2 1901-01-01T00:00:00 652375.sdb - done Tue Aug 30 12:07:31 2022 : observe_compute 2 1900-01-01T00:00:00 652375.sdb - do ne Tue Aug 30 12:07:37 2022 : compute 1 1900-01-01T00:00:00 82548 - done Tue Aug 30 12:45:14 2022 : compute 2 1901-01-01T00:00:00 652375.sdb - start Tue Aug 30 12:45:57 2022 : observe_compute 2 1901-01-01T00:00:00 652838.sdb - st art

Between 08:27:57 and 12:04:29 the esm_runscripts are only tidying up and accessing some FESOM output data. 350 nodes are blocked for such a long time - for comparison: the computation of 1 year takes 05:20 hours while the tidying up takes 03:36 hours. In the case of the limited output the situation is not as bad (13 minutes for tidying up) but could still be improved. Question is for what purpose the FESOM output data are accessed. It seems like that the FESOM output data are not only moved from one directory to the other but that something is also done to the FESOM data. Is there a possibility to optimize this? It would also help if the esm tools would output time stamps to see due to which esm tools process the time is lost.

JanStreffing commented 1 year ago

Perhaps we can make use of the fact that we are on multi proc machines. In shell I used to find loops with a similar number of elements as there are threads. e.g. a loop over the output variables and put a & at the end of the mv command.

The equivalent in python might be to encapsulate the os.sys("mv ${file}") command inside of a dask delayed section. You can find an example here: https://github.com/JanStreffing/2020_AWICM3_GMD_PAPER/blob/main/python/hovm_difference-cdo.ipynb block 6

As this would be something in the backend, I'd not do the integration myself though. Any takers? :) I guess we could save maybe a factor 10 or here.

mandresm commented 1 year ago

The parallel feature will be added during the refactoring of the file dictionaries, I have added to the project so that we don't forget about it.

joakimkjellsson commented 1 year ago

Just to comment: We had this problem with FOCI-OpenIFS using a 1/12° ocean grid. Our "solution" was to directly write output to the outdata dir rather than the work dir. In an xml file for XIOS this means doing

<file_definition type="one_file" name="../../outdata/nemo/@expname@_@freq@_@startdate@_@enddate@" sync_freq="1d" min_digits="4">

mandresm commented 1 year ago

Thanks @joakimkjellsson, goes a little against the safety of the work directory, but I agree, some times we need work-arounds. The good news is that many of these problems will be solved once we release the new filedicts syntax/module and we offer to the user two file structures: the old ultra-safe structure (run_DATE/[work, outdata, restarts, ...]) or the less safe but faster "run_DATE is the work folder" (no file duplication inside run_date)

mandresm commented 1 year ago

@tsemmler05, was this specific problem already solved by telling ESM-Tools to move the files instead of copy them?

Even if that's the case, I'll keep the issue opened until we incorporate the parallelization suggested by @JanStreffing into ESM-Tools.

JanStreffing commented 1 year ago

Though it was not not needed here, I think it still nice to have. There will always be some files to copy around because they come from pool and are altered at runtime.

esm-tools / esm_tools

Increase speed of tidying up after the simulation has finished #811