cloudmesh / cloudmesh-cc

Cloudmesh compute coordinator the execute compute intensife workflows on remote resources
https://cloudmesh.github.io/cloudmesh-cc/
Other
2 stars 0 forks source link

develop workflow that runs all mnist programs #28

Closed laszewsk closed 1 year ago

j-miskill commented 2 years ago

This has been partially finished in the code directory.

https://github.com/cloudmesh/cloudmesh-cc/tree/main/tests/mnist

Here ^.

jpfleischer commented 2 years ago

Currently an mnist test has been implemented https://github.com/cloudmesh/cloudmesh-cc/blob/main/tests/test_070_run_mnist_workflow_exec.py

This test runs this python file: https://github.com/cybertraining-dsc/reu2022/blob/main/code/deeplearning/mnist/run_all_rivanna.py

...which executes the mlp_mnist and the mnist_autoencoder Jupyter notebooks. https://github.com/cybertraining-dsc/reu2022/blob/main/code/deeplearning/mnist/mlp_mnist.ipynb https://github.com/cybertraining-dsc/reu2022/blob/main/code/deeplearning/mnist/mnist_autoencoder.ipynb

This issue needs a review to see if this has successfully completed the objective, or if there is more to be done.

jpfleischer commented 1 year ago
+---------------------------------+----------+-----------+-----------+---------------------+----------------------------
----------+-------+---------+--------+-------+-------------------------------------+
| Name                            | Status   |      Time |       Sum | Start               | tag
          | msg   | Node    | User   | OS    | Version                             |
|---------------------------------+----------+-----------+-----------+---------------------+----------------------------
----------+-------+---------+--------+-------+-------------------------------------|
| v100-total                      | ok       | 16451.4   | 16451.4   | 2022-09-23 18:56:43 | rivanna-IntelXeonE5-
2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed Feb 23 16:47:03 UTC 2022 |
| mlp_mnist                       | ok       |     0.014 |     0.014 | 2022-09-23 23:30:54 | rivanna-IntelXeonE5-
2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed Feb 23 16:47:03 UTC 2022 |
| mnist_autoencoder               | ok       |     0.007 |     0.007 | 2022-09-23 23:30:54 | rivanna-IntelXeonE5-
2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed Feb 23 16:47:03 UTC 2022 |
| mnist_cnn                       | ok       |     0.006 |     0.006 | 2022-09-23 23:30:54 | rivanna-IntelXeonE5-
2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed Feb 23 16:47:03 UTC 2022 |
| mnist_lstm                      | ok       |     0.006 |     0.006 | 2022-09-23 23:30:54 | rivanna-IntelXeonE5-
2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed Feb 23 16:47:03 UTC 2022 |
| mnist_mlp_with_lstm             | ok       |     0.006 |     0.006 | 2022-09-23 23:30:54 | rivanna-IntelXeonE5-
2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed Feb 23 16:47:03 UTC 2022 |
| mnist_rnn                       | ok       |     0.006 |     0.006 | 2022-09-23 23:30:54 | rivanna-IntelXeonE5-
2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed Feb 23 16:47:03 UTC 2022 |
| mnist_with_distributed_training | ok       |     0.006 |     0.006 | 2022-09-23 23:30:54 | rivanna-IntelXeonE5-
2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed Feb 23 16:47:03 UTC 2022 |
| mnist_with_pytorch              | ok       |     0.006 |     0.006 | 2022-09-23 23:30:54 | rivanna-IntelXeonE5-
2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed Feb 23 16:47:03 UTC 2022 |
+---------------------------------+----------+-----------+-----------+---------------------+----------------------------
----------+-------+---------+--------+-------+-------------------------------------+

it takes a long time, but it works. pytest -v -x --capture=no tests/test_070_run_mnist_workflow_exec.py

jpfleischer commented 1 year ago

updated times. it takes nearly 4 hours. perhaps we dont include this in the usual pytest suite? or we lessen the number of epochs?

+---------------------------------+----------+-----------+-----------+---------------------+--------------------------------------+-------+---------+--------+-------+-----------
--------------------------+
| Name                            | Status   |      Time |       Sum | Start               | tag                                  | msg   | Node    | User   | OS    | Version
                          |
|---------------------------------+----------+-----------+-----------+---------------------+--------------------------------------+-------+---------+--------+-------+-----------
--------------------------|
| v100-total                      | ok       | 13137.3   | 13137.3   | 2022-09-24 00:41:05 | rivanna-XXXXXX-IntelXeonE5-2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed
 Feb 23 16:47:03 UTC 2022 |
| mlp_mnist                       | ok       |   178.13  |   178.13  | 2022-09-24 00:41:05 | rivanna-XXXXXX-IntelXeonE5-2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed
 Feb 23 16:47:03 UTC 2022 |
| mnist_autoencoder               | ok       |   354.45  |   354.45  | 2022-09-24 00:44:03 | rivanna-XXXXXX-IntelXeonE5-2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed
 Feb 23 16:47:03 UTC 2022 |
| mnist_cnn                       | ok       |  1291.1   |  1291.1   | 2022-09-24 00:49:58 | rivanna-XXXXXX-IntelXeonE5-2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed
 Feb 23 16:47:03 UTC 2022 |
| mnist_lstm                      | ok       |   892.249 |   892.249 | 2022-09-24 01:11:29 | rivanna-XXXXXX-IntelXeonE5-2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed
 Feb 23 16:47:03 UTC 2022 |
| mnist_mlp_with_lstm             | ok       |  4232.95  |  4232.95  | 2022-09-24 01:26:21 | rivanna-XXXXXX-IntelXeonE5-2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed
 Feb 23 16:47:03 UTC 2022 |
| mnist_rnn                       | ok       |   297.577 |   297.577 | 2022-09-24 02:36:54 | rivanna-XXXXXX-IntelXeonE5-2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed
 Feb 23 16:47:03 UTC 2022 |
| mnist_with_distributed_training | ok       |  4491.24  |  4491.24  | 2022-09-24 02:41:52 | rivanna-XXXXXX-IntelXeonE5-2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed
 Feb 23 16:47:03 UTC 2022 |
| mnist_with_pytorch              | ok       |  1399.59  |  1399.59  | 2022-09-24 03:56:43 | rivanna-XXXXXX-IntelXeonE5-2630-v100 |       | rivanna | XXXXXX | Linux | #1 SMP Wed
 Feb 23 16:47:03 UTC 2022 |
+---------------------------------+----------+-----------+-----------+---------------------+--------------------------------------+-------+---------+--------+-------+-----------
--------------------------+
# csv,timer,status,time,sum,start,tag,msg,uname.node,user,uname.system,platform.version
# csv,v100-total,ok,13137.33,13137.33,2022-09-24 00:41:05,rivanna-XXXXXX-IntelXeonE5-2630-v100,None,rivanna,XXXXXX,Linux,#1 SMP Wed Feb 23 16:47:03 UTC 2022
# csv,mlp_mnist,ok,178.13,178.13,2022-09-24 00:41:05,rivanna-XXXXXX-IntelXeonE5-2630-v100,None,rivanna,XXXXXX,Linux,#1 SMP Wed Feb 23 16:47:03 UTC 2022
# csv,mnist_autoencoder,ok,354.45,354.45,2022-09-24 00:44:03,rivanna-XXXXXX-IntelXeonE5-2630-v100,None,rivanna,XXXXXX,Linux,#1 SMP Wed Feb 23 16:47:03 UTC 2022
# csv,mnist_cnn,ok,1291.096,1291.096,2022-09-24 00:49:58,rivanna-XXXXXX-IntelXeonE5-2630-v100,None,rivanna,XXXXXX,Linux,#1 SMP Wed Feb 23 16:47:03 UTC 2022
# csv,mnist_lstm,ok,892.249,892.249,2022-09-24 01:11:29,rivanna-XXXXXX-IntelXeonE5-2630-v100,None,rivanna,XXXXXX,Linux,#1 SMP Wed Feb 23 16:47:03 UTC 2022
# csv,mnist_mlp_with_lstm,ok,4232.946,4232.946,2022-09-24 01:26:21,rivanna-XXXXXX-IntelXeonE5-2630-v100,None,rivanna,XXXXXX,Linux,#1 SMP Wed Feb 23 16:47:03 UTC 2022
# csv,mnist_rnn,ok,297.577,297.577,2022-09-24 02:36:54,rivanna-XXXXXX-IntelXeonE5-2630-v100,None,rivanna,XXXXXX,Linux,#1 SMP Wed Feb 23 16:47:03 UTC 2022
# csv,mnist_with_distributed_training,ok,4491.236,4491.236,2022-09-24 02:41:52,rivanna-XXXXXX-IntelXeonE5-2630-v100,None,rivanna,XXXXXX,Linux,#1 SMP Wed Feb 23 16:47:03 UTC 2022
# csv,mnist_with_pytorch,ok,1399.593,1399.593,2022-09-24 03:56:43,rivanna-XXXXXX-IntelXeonE5-2630-v100,None,rivanna,XXXXXX,Linux,#1 SMP Wed Feb 23 16:47:03 UTC 2022
laszewsk commented 1 year ago

yes this should be not in pytest but example dire. the issue may be that if you say pytest without parameter it looks through ll dirs with test*.py and executes them, so I how we can name it example*.py and it can be started with example_ name

jpfleischer commented 1 year ago

pytest -v -x --capture=no examples/example_run_mnist_workflow_exec.py

this test now iterates through the GPUs available on rivanna:

gpu = ['v100', 'a100', 'k80', 'p100']

and runs all mnist python scripts successfully. by submitting sbatch with the --gres=gpu parameter.

laszewsk commented 1 year ago

please document and add section for how to run and what output will be

jpfleischer commented 1 year ago

i believe this is done. https://github.com/cloudmesh/cloudmesh-cc/blob/main/api/source/mnist.md

jpfleischer commented 1 year ago

https://cloudmesh.github.io/cloudmesh-cc/mnist.html