clusterinthecloud / terraform

Terraform config for Cluster in the Cloud
https://cluster-in-the-cloud.readthedocs.io
MIT License
20 stars 23 forks source link

job code and data storage location #20

Closed jtsaismith closed 5 years ago

jtsaismith commented 5 years ago

The file system on /mnt/shared is quite slow. Is it possible to place the code and data on each node's local storage, to improve data / file retrieval by each node? If yes, what are the steps to do so?

christopheredsall commented 5 years ago

We will shortly be able to give a procedure to use gluster. In the meantime, yes the NFS filesystem is not optimised for performance. As a regular (not opc) user you can write on the compute nodes in to /tmp which is a disk backed filesystem. Also, by default RedHat derived operating systems create a temporary memory based filesystem under /run/user/ named for the user ID.

E.g. this is on a BM.Standard.E2.64 shape

[ce16990@bm-standard-e2-64-ad3-0009 ~]$ df -h /tmp /run/user/$(id -u)
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        39G  2.6G   36G   7% /
tmpfs            51G     0   51G   0% /run/user/10002
[ce16990@bm-standard-e2-64-ad3-0009 ~]$ touch /tmp/test /run/user/$(id -u)/test
[ce16990@bm-standard-e2-64-ad3-0009 ~]$ ls -l /tmp/test /run/user/$(id -u)/test
-rw-r--r--. 1 ce16990 users 0 Apr 29 21:09 /run/user/10002/test
-rw-r--r--. 1 ce16990 users 0 Apr 29 21:09 /tmp/test

At the end of the slurm job script you can copy back the output to somewhere under /mnt/shared.

chaoyanghe commented 5 years ago

I compared the computation performance between your compute node and our physical server. I found the performance gap is 32s (Oracle Cloud) vs 10ms (Our lab's). I don't know why, it seems the CPU configuration is similar. Is this caused by the File Sharing issue? (I put my project code and data in the /mnt/shared directory and then submit my job by "sbatch **.slm" here )

Oracle cloud compute node (48 Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz), the computation walk clock time for a single training round is around 32 s. 6 - 2019-04-29 21:07:10,855:fl_client_manager.py:307:INFO: ###START TRAINING. Round: 1/100 ### 6 - 2019-04-29 21:07:42,168:fl_client_manager.py:377:INFO: ###END TRAINING. Round: 1/100 ###

Our lab's physical server (40 Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz): the computation walk clock time for a single training round is around 10 ms 4 - 2019-04-29 14:08:12,294:fl_client_manager.py:307:INFO: ###START TRAINING. Round: 15/100 ### 4 - 2019-04-29 14:08:12,303:fl_client_manager.py:377:INFO: ###END TRAINING. Round: 15/100 ###

jtsaismith commented 5 years ago

Thanks for the info on using /tmp and /run/user, Christopher! Chaoyan modified his function to avoid printing logs to a local file. He now routes the logs to the MPI console using python logging.info(). This boosted the performance of each training round tremendously, and he's seeing MUCH better performance on the Oracle Cloud cluster than he did on the physical server in his lab.