N8-CIR-Bede / documentation

Documentation for the N8CIR Bede Tier 2 HPC faciltiy
https://bede-documentation.readthedocs.io/en/latest/
7 stars 11 forks source link

confusion with wmcle #63

Closed loveshack closed 2 years ago

loveshack commented 3 years ago

I had a user who was confused by "Powerai and wmlce" saying "Possibly Out of Date" and going to the IBM site. I guess it should say that's superseded by opence, which dropped the large model support. There could be a pointer to the LM patches, and the discussions about (not) merging them, in case someone is motivated to update them.

markdturner commented 3 years ago

We're waiting for the upgrade to REHL8 before making these changes

ptheywood commented 2 years ago

I'd started working on this in #67, but to avoid blocking that being merged I'll defer to figuring out the exact state of opence and wmlce until later, for now just moving the existing (potentially out of date) wmlce docs to their new location.


My current understanding, is that WMLCE (or PowerAI, it's other name) 1.7 was the final release, from 2020-02-21. It only officially supports RHEL7.6/7.7 with CUDA driver 440 on Power9 hosts. I do not know if it works with RHEL8 or not.

It included / supported TensorFlow 2.1, PyToprch 1.3.1, and Horovod 0.19 amongst others (i.e. more recent versions do not support any ibm specific features, unless upstreamed).

TensorFlow LMS could be enabled by tf.config.experimental.set_lms_enabled(True) in that version, but as far as I could tell when i last looked the LMS patches were never upstreamed?

Open-CE (An open cognitive environment) is a non-IBM set of conda packages designed to work together, and be easily distributed by a single conda channel.

https://github.com/open-ce https://github.com/open-ce/open-ce

It supports multiple CPU architectures, including x86 and Power. OSU provide a hosted x86/power conda channel, while MIT host a power channel.

https://ftp.osuosl.org/pub/open-ce/current/ https://opence.mit.edu/

Open CE requires conda >= 3.8.6, and supports Python 3.7 to 3.9. CUDA 10.2, 11.0, 11.2 (when I originally looked into this, it may have changed since).

OpenCE releases support specific versions of tensorflow etc.

In general, LMS doesn't look like it is supported outside of wmlce. ddlrun (and therefore bede-ddlrun) don't appear to be supported either. It might be nice to run some benchmarks with and without ddlrun prior to the rhel7 migration progressing, to see how much of an impact losing ddlrun might have.


For changes to the docs post #67 , I'd lean towards:


Summit's documentation suggests using jsrun in place of the deprecated ddlrun, This is the IBM scheduler command, so this would map srun/sbatch on Bede (with appropriate flags?).

https://docs.olcf.ornl.gov/software/analytics/ibm-wml-ce.html#running-distributed-deep-learning-jobs


For my reference in the future, my WIP comments about this were as follows


.. WMLCE /PowerAI 1.7 is the final release, from 2020-02-21. Archived on 2020-11-10. 
.. https://www.ibm.com/support/pages/get-started-ibm-wml-ce
.. Only supported RHEL 7.6 and 7.7, with driver 440.
.. TF 2.1, PyTorch 1.3.1, Horovod 0.19, TFLMS (via tf.config.experimental.set_lms_enabled(True))

.. Open-CE (Open Cognitive Environment) replaces wmlce. 
.. https://github.com/open-ce
.. https://github.com/open-ce/open-ce
.. Supports Power/x86. Python 3.7 to 3.9. CUDA 10.2, 11.0, 11.2.
.. Requires conda >= 3.8.3
.. Oregon state hosts pre-build for power and x86 https://ftp.osuosl.org/pub/open-ce/current/
.. MIT hosts pre-build OpenCE https://opence.mit.edu/
.. OpenCE 1.2.2 TF 2.4.2, pytorch 1.7.1, horovod 0.21.0, 
.. OpenCE 1.0.0 has TF 2.3.1 , pytorch 1.6.0, horovod 0.19.5

.. Docs plan:
.. Main section will be OpenCE. Blurb stating formerly WMLCE, but no longer supported, and will be no longer available from RHEL 8 upgrade. 
.. List the missing features? 
.. * LMS doesn't appear to have been upstreamed for tf or pytorch.
.. * ddlrun/bede-ddlrun - These are probably not supported either.  
.. Update the tf/torch docs to include this?
.. It may be worth benchmarking resnet50 again with and without ddlrun?

.. Satori docs may provide additional context https://mit-satori.github.io/satori-ai-frameworks.html
ptheywood commented 2 years ago

PR #102 is now ready for review, which documents Open-CE and adds a number of updates to the WMLCE section to clearly show it is deprecated / not supported and will not (fully) work on RHEL 8.

@loveshack I've requested your review to see if you feel it has clarrified the concerns you raised, but no pressure to provide a review.