Revised WMLCE + Open-CE documentation.

ptheywood commented 2 years ago

Improves the state of WMLCE documentation, adds Open-CE documentation and updates Tensorflow/PyTorch to reflect this change.

[x] Updates WMLCE page
- [x] Document current project status (Deprecated / unsupported)
- [x] Clearly state that users should probably switch to Open-CE or upstream distributions
- [x] Verify if it is still usable on RHEL 8 or not, update accordingly
- Errors occured when attempting to bede-ddlrun on RHEL 8.
- [x] Update usage instructions to no longer be [Possibly Out of Date].
- [x] Update WMLCE resnet50 benchmark section
- This is only available via WMLCE with a licence which may prevent distribution outside of WMLCE so not re-running to generate results.
- DDLrun errors on RHEL 8 as expected, so not useful to compare wmlce on 7 vs 8.
[x] Adds a page documenting Open-CE, the successor to WMLCE.
- [x] Basic Description
- [x] Usage with example
- [x] Verify instructions
- [x] Why should someone use Open-CE rather than upstream TF/pytorch?
- [x] Clear description of missing WMLCE features (LMS, DDLRUN?, Others?)
- ~Benchmarking, use the resnet benchmark from above on <= 4 nodes?~
- Not benchmarking as WMLCE tensorflow-benchmarks is not openly licenced by ibm, and ddlrun doesn't work.
- [x] Cross reference TF, Pytorch, WMLCE.
[x] Add cross references to Open-CE from Tensorflow, Conda and Pytorch pages

Closes #63 Closes #72

ptheywood commented 2 years ago

This may also have to include updates to general miniconda installation instructions.

The current instructions for installing into /nobackup/projects/<project> will actually install int /users/.

sh Miniconda3-latest-Linux-ppc64le.sh -b -p $(pwd)/miniconda

The above will install silently into a directory miniconda within the current directory and not update the users .bashrc which may or may not be desirable.

ptheywood commented 2 years ago

Aditionally, the current wmlce instructions add the wmlce channel to the users global conda channel configuration, not per environment.

This breaks subsequent use of open-ce. There is likely need to add instructions on how to deal with this (i.e. if an unsatisfiableError is raised).

It would also be better to adjust he wmlce instructions to only set the channel within the environment. The same applies to Open-CE (via the conda --env flag on conda config --set`?)

ptheywood commented 2 years ago

The WMLCE tensorflow-benchmarks/resnet50 benchmark script is not included in Open-CE. It is Apahce 2 licenced by Nvidia, but with IBM modifications with the follwoing restrictive / unclear licence, so I am not going to make this available for use outside of wmlce, so can't use this for a comparitive benchmark.

Licensed Materials - Property of IBM (C) Copyright IBM Corp. 2020. All Rights Reserved. US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

The existing WMLCE benchmark on RHEL 7 with ddlrun and 4 gpus in a single node behaves as described, although it suspiciously took 15 seconds less than the 5 hours I requested( compared to 4 hours as previously suggested).

:::NVLOGv0.2.3 resnet 1646176253.965135813 (training_hooks.py:101) imgs_per_sec: 3737.334556206633
:::NVLOGv0.2.3 resnet 1646176253.967950821 (training_hooks.py:102) cross_entropy: 1.9711014032363892
:::NVLOGv0.2.3 resnet 1646176253.970749617 (training_hooks.py:103) l2_loss: 0.3724885880947113
:::NVLOGv0.2.3 resnet 1646176253.973543644 (training_hooks.py:104) total_loss: 2.343590021133423
:::NVLOGv0.2.3 resnet 1646176253.976323843 (training_hooks.py:105) learning_rate: 6.103515914901436e-08
:::NVLOGv0.2.3 resnet 1646176256.492153883 (training_hooks.py:112) epoch: 49
:::NVLOGv0.2.3 resnet 1646176256.495479345 (training_hooks.py:113) final_cross_entropy: 1.8775297403335571
:::NVLOGv0.2.3 resnet 1646176256.498734951 (training_hooks.py:114) final_l2_loss: 0.3724885582923889
:::NVLOGv0.2.3 resnet 1646176256.502001047 (training_hooks.py:115) final_total_loss: 2.250018358230591
:::NVLOGv0.2.3 resnet 1646176256.505250216 (training_hooks.py:116) final_learning_rate: 0.0
:::NVLOGv0.2.3 resnet 1646176265.872462511 (runner.py:488) Ending Model Training ...
:::NVLOGv0.2.3 resnet 1646176265.874462605 (runner.py:221) XLA is activated - Experimental Feature
:::NVLOGv0.2.3 resnet 1646176266.303135633 (runner.py:555) Starting Model Evaluation...
:::NVLOGv0.2.3 resnet 1646176266.304392338 (runner.py:556) Evaluation Epochs: 1.0
:::NVLOGv0.2.3 resnet 1646176266.305591822 (runner.py:557) Evaluation Steps: 195.0
:::NVLOGv0.2.3 resnet 1646176266.306783438 (runner.py:558) Decay Steps: 195.0
:::NVLOGv0.2.3 resnet 1646176266.307974577 (runner.py:559) Global Batch Size: 256
2022-03-01 23:11:09.188563: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties: 
pciBusID: 0004:04:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.50GiB deviceMemoryBandwidth: 836.37GiB/s
2022-03-01 23:11:09.252778: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2022-03-01 23:11:09.357202: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-03-01 23:11:09.391700: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2022-03-01 23:11:09.391726: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2022-03-01 23:11:09.391747: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2022-03-01 23:11:09.391764: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2022-03-01 23:11:09.410918: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-03-01 23:11:09.413685: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0
2022-03-01 23:11:09.413734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1099] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-03-01 23:11:09.413743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105]      0 
2022-03-01 23:11:09.413750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 0:   N 
2022-03-01 23:11:09.419813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30294 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0004:04:00.0, compute capability: 7.0)
2022-03-01 23:11:11.753567: I tensorflow/core/grappler/optimizers/generic_layout_optimizer.cc:345] Cancel Transpose nodes around Pad: transpose_before=resnet50_v1.5/input_reshape/transpose pad=resnet50_v1.5/conv2d/Pad transpose_after=resnet50_v1.5/conv2d/conv2d/Conv2D-0-TransposeNCHWToNHWC-LayoutOptimizer
:::NVLOGv0.2.3 resnet 1646176309.121007442 (runner.py:610) Top-1 Accuracy: 75.797
:::NVLOGv0.2.3 resnet 1646176309.122350454 (runner.py:611) Top-5 Accuracy: 92.817
:::NVLOGv0.2.3 resnet 1646176309.123546600 (runner.py:630) Ending Model Evaluation ...

---------------
Job output ends
=========================================================
SLURM job: finished date = Tue 1 Mar 23:11:50 GMT 2022
Total run time : 4 Hours 59 Minutes 45 Seconds
=========================================================
(END)

Attempting to run this via bede-ddlrun on the RHEL 8 image errored with the following:

No active IB device ports detected
[gpu013.bede.dur.ac.uk:98383] Error: common_pami.c:1087 - ompi_common_pami_init() 0: Unable to create 1 PAMI communication context(s) rc=1
No active IB device ports detected
[gpu013.bede.dur.ac.uk:98384] Error: common_pami.c:1087 - ompi_common_pami_init() 1: Unable to create 1 PAMI communication context(s) rc=1
No active IB device ports detected
[gpu013.bede.dur.ac.uk:98386] Error: common_pami.c:1087 - ompi_common_pami_init() 3: Unable to create 1 PAMI communication context(s) rc=1
No active IB device ports detected
[gpu013.bede.dur.ac.uk:98385] Error: common_pami.c:1087 - ompi_common_pami_init() 2: Unable to create 1 PAMI communication context(s) rc=1
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      gpu013
  Framework: pml
--------------------------------------------------------------------------
[gpu013.bede.dur.ac.uk:98383] PML pami cannot be selected
[gpu013.bede.dur.ac.uk:98357] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
[gpu013.bede.dur.ac.uk:98357] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
[gpu013.bede.dur.ac.uk:98357] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
[gpu013.bede.dur.ac.uk:98386] PML pami cannot be selected
[gpu013.bede.dur.ac.uk:98384] PML pami cannot be selected
[gpu013.bede.dur.ac.uk:98385] PML pami cannot be selected
[gpu013.bede.dur.ac.uk:98357] 3 more processes have sent help message help-mca-base.txt / find-available:none found
[gpu013.bede.dur.ac.uk:98357] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

---------------
Job output ends
=========================================================
SLURM job: finished date = Tue 1 Mar 18:22:12 GMT 2022
Total run time : 0 Hours 1 Minutes 29 Seconds
=========================================================

This appears to confirm that ddlrun/bede-ddlrun does not work on the RHEL 8 nodes, though I'll double check this via slack later.

Attempting to run the benchmark on RHEL 8 without ddlrun, using a single GPU in a single node resulted in the job being killed for OOM.

---------------
Job output ends
=========================================================
SLURM job: finished date = Tue 1 Mar 22:13:38 GMT 2022
Total run time : 0 Hours 52 Minutes 24 Seconds
=========================================================
slurmstepd: error: Detected 1 oom-kill event(s) in step 356980.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

I.e. it requires more than 1/4 of the node's memory, but otherwise WMLCE tensorflow does appear to work on RHEL 8, though given its very EOL most users should migrate to Open-CE or upstream tensorflow (and lose LMS)

Running the single GPU version on RHEL 7 also died due to OOM, but several hours further into the run.

To benchmark this, correctly, requesting a full node but only making one device available by CUDA_VISIBLE_DEVICES, or an inner srun might be required, but if this is not going to be directly comparable to a WMLCE or RHEL 8 benchmark of the same model it's probably not worthwhile reproducing / benchmarking.

Finding a more recent / more open benchmark to run might be a better plan.

=========================================================
SLURM job: finished date = Wed 2 Mar 00:49:40 GMT 2022
Total run time : 3 Hours 15 Minutes 36 Seconds
=========================================================
slurmstepd: error: Detected 1 oom-kill event(s) in step 356990.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

loveshack commented 2 years ago

I don't now remember the context, but I guess if it's documented to use open-ce rather than wmlce, that's OK. For what it's worth, there's something about it in the Summit docs (specifically about RHEL8, I think).

N8-CIR-Bede / documentation

Revised WMLCE + Open-CE documentation. #102