ROCm / omnitrace

Omnitrace: Application Profiling, Tracing, and Analysis
https://rocm.docs.amd.com/projects/omnitrace/en/latest/
MIT License
297 stars 27 forks source link

Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU #284

Closed daviteix closed 1 year ago

daviteix commented 1 year ago

Here are the steps to reproduce:

  1. git clone https://github.com/mlcommons/science.git
  2. download data: aws s3 --no-sign-request --endpoint-url https://s3.echo.stfc.ac.uk/ sync s3://sciml-datasets/ms/stemdl_ds1a ./
  3. conda create stemdl
  4. pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
  5. pip3 install pytorch-lightning scikit-learn
  6. git clone https://github.com/mlperf/logging.git mlperf-logging
  7. pip3 install -e mlperf-logging
  8. cd STEMDL/science/benchmarks/stemdl/stfc
  9. change gpu: 1 to gpu: 4 in stemdlConfig.yaml
  10. omnitrace-python-3.8 -- ./stemdl_classification.py --config ./stemdlConfig.yaml

It will print the following and then hang:

_##### omnitrace :: executing 'python3.8 -m omnitrace -- ./stemdl_classification.py --config ./stemdlConfig.yaml'... #####

[omnitrace]> profiling: ['/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py', '--config', './stemdlConfig.yaml']
[omnitrace][569913][omnitrace_init_tooling] Instrumentation mode: Trace

      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

    omnitrace v1.10.0 (rev: 9de3a6b0b4243bf8ec10164babdd99f64dbc65f2, tag: v1.10.0, compiler: GNU v8.5.0, rocm: v5.4.x)
[omnitrace][569913][2047] No signals to block...
[omnitrace][569913][2046] No signals to block...
[omnitrace][569913][2045] No signals to block...
[omnitrace][569913][2044] No signals to block...
[966.269]       perfetto.cc:58656 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
[omnitrace][569913] fork() called on PID 569913 (rank: 0), TID 0
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/connector.py:555: UserWarning: 16 is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
  rank_zero_warn(
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.8 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/st ...
  rank_zero_warn(
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
:::MLLOG {"namespace": "", "time_ms": 1686768962518, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "STEMDL", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 145}}
:::MLLOG {"namespace": "", "time_ms": 1686768966794, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "STFC", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 146}}
:::MLLOG {"namespace": "", "time_ms": 1686768966876, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "SciML", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 147}}
:::MLLOG {"namespace": "", "time_ms": 1686768966956, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "research", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 148}}
:::MLLOG {"namespace": "", "time_ms": 1686768967037, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "AMD MI250", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 149}}
:::MLLOG {"namespace": "", "time_ms": 1686768967119, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 150}}
:::MLLOG {"namespace": "", "time_ms": 1686768967199, "event_type": "POINT_IN_TIME", "key": "number_of_ranks", "value": 4, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 153}}
:::MLLOG {"namespace": "", "time_ms": 1686768967280, "event_type": "POINT_IN_TIME", "key": "number_of_nodes", "value": 1, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 154}}
:::MLLOG {"namespace": "", "time_ms": 1686768967361, "event_type": "POINT_IN_TIME", "key": "accelerators_per_node", "value": 8, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 155}}
:::MLLOG {"namespace": "", "time_ms": 1686768967441, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 156}}
:::MLLOG {"namespace": "", "time_ms": 1686768967521, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start:Loading datasets", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 157}}
:::MLLOG {"namespace": "", "time_ms": 1686768991051, "event_type": "POINT_IN_TIME", "key": "eval_stop", "value": "Stop: Loading datasets", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 175}}
:::MLLOG {"namespace": "", "time_ms": 1686768991135, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start: Loading model", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 176}}
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
  warnings.warn(msg)
:::MLLOG {"namespace": "", "time_ms": 1686768991708, "event_type": "POINT_IN_TIME", "key": "eval_stop", "value": "Stop: Loading model", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 181}}
:::MLLOG {"namespace": "", "time_ms": 1686768991791, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start: Training", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 189}}
[omnitrace][569913] fork() called on PID 569913 (rank: 0), TID 0
[omnitrace][569913] fork() called on PID 569913 (rank: 0), TID 0
Traceback (most recent call last):
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 397, in <module>
    main()
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 290, in main
    raise RuntimeError(
RuntimeError: Could not determine input script. Use '--' before the script and its arguments to ensure correct parsing.
E.g. python -m omnitrace -- ./script.py
[omnitrace][569913] fork() called on PID 569913 (rank: 0), TID 0
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.8 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/st ...
  rank_zero_warn(
Traceback (most recent call last):
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 397, in <module>
    main()
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 290, in main
    raise RuntimeError(
RuntimeError: Could not determine input script. Use '--' before the script and its arguments to ensure correct parsing.
E.g. python -m omnitrace -- ./script.py
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Traceback (most recent call last):
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 397, in <module>
    main()
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 290, in main
    raise RuntimeError(
RuntimeError: Could not determine input script. Use '--' before the script and its arguments to ensure correct parsing.
E.g. python -m omnitrace -- ./script.py_

With gpu: 1, it works fine.

jrmadsen commented 1 year ago

conda activate stemdl

where does this conda env come from?

Do you know how the PyTorch execution model changes when multiple GPUs are used? Does it fork for each additional GPU? Bc I’m seeing 3 fork calls which suggests that might be the root cause of the issue.

daviteix commented 1 year ago

My mistake, it should have been: conda create stemdl. Yes, it uses fork. Is there a workaround?

jrmadsen commented 1 year ago

fork has caused a number of problems in the past, mostly related to perfetto bc of a background thread. You might want to try perfetto with the system backend. You will probably want to increase the flush and write periods to the same as the duration in the perfetto config file (see sample here) because of quirks w.r.t. how perfetto writes that file and how omnitrace writes some perfetto data — essentially once perfetto flushes/writes data, you can’t add any time-stamped data that happened before that point and a fair amount of data gathered through sampling isn’t passed to perfetto until finalization bc we have to map instruction pointers to line info and doing so while sampling adds too much overhead during runtime

daviteix commented 1 year ago

Is there a command example when using omnitrace-python? I have tried without success: export OMNITRACE_PERFETTO_BACKEND=system omnitrace-perfetto-traced --background omnitrace-perfetto --out ./omnitrace-perfetto.proto --txt -c ${OMNITRACE_ROOT}/rocm-5.4/share/omnitrace/omnitrace.cfg --background omnitrace-python-3.8 -- ./stemdl_classification.py --config ./stemdlConfig.yaml The option --perfetto-backend=system is not valid for omnitrace-python.

jrmadsen commented 1 year ago

Update: I’ve tracked down the issue. It’s not related to perfetto, but rather the sys.argv passed to omnitrace’s __main__.py upon re-entry after PyTorch forks. I should have a PR merged with the fix by tomorrow afternoon.

daviteix commented 1 year ago

[AMD Official Use Only - General]

I still get the error with the new code. Only difference is I am not using slurm. (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$cd - /home/dteixeir/OMNITRACE/omnitrace/source/python/omnitrace (stemdl) [dteixeir@electra019 ~/OMNITRACE/omnitrace/source/python/omnitrace]$git log -1 commit a85f141afebe2007dde3ebc82c3ff11e50b08bf7 (HEAD -> main, origin/main, origin/HEAD) Author: Jonathan R. Madsen jrmadsen@users.noreply.github.com Date: Wed Jun 21 22:30:47 2023 -0500

PyTorch Python fork fix (#291)

* PyTorch Python fork fix

- fixes issue where forking process in PyTorch causes omnitrace/__main__.py to fail due to missing script argument

* Update source/python/omnitrace/__main__.py

Remove debugging "print" LOC

(stemdl) [dteixeir@electra019 ~/OMNITRACE/omnitrace/source/python/omnitrace]$cd - /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$which omnitrace ~/OMNITRACE/omnitrace_install/bin/omnitrace (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$ps |grep perfetto 1 S dteixeir 2109553 1 0 80 0 - 1126 - 23:23 ? 00:00:00 perfetto --out stemdl.proto --txt -c ./omni-perfetto.cfg --background 0 S dteixeir 2110245 1967519 0 80 0 - 3037 - 23:27 pts/0 00:00:00 grep --color=auto perfetto (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$export OMNITRACE_PERFETTO_BACKEND=system (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$ps |grep traced 1 S dteixeir 2104500 1 0 80 0 - 2834 ia32_s 22:45 ? 00:00:10 traced --background 0 S dteixeir 2110356 1967519 0 80 0 - 3037 - 23:28 pts/0 00:00:00 grep --color=auto traced (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$python -m omnitrace -- ./stemdl_classification.py --config ./stemdlConfig.yaml [omnitrace]> profiling: ['/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py', '--config', './stemdlConfig.yaml'] [omnitrace][2110366][omnitrace_init_tooling] Instrumentation mode: Trace

  ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
 /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
|  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
|  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
|  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
 \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

omnitrace v1.10.1 (compiler: GNU v8.5.0, rocm: v5.4.x)

[omnitrace][2110366][510] No signals to block... [omnitrace][2110366][509] No signals to block... [omnitrace][2110366][508] No signals to block... [omnitrace][2110366][507] No signals to block... [omnitrace][2110366] fork() called on PID 2110366 (rank: 0), TID 0 /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/connector.py:555: UserWarning: 16 is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead! rank_zero_warn( /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun like so: srun python /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemd ... rank_zero_warn( Using 16bit Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs :::MLLOG {"namespace": "", "time_ms": 1687469339160, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "STEMDL", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 145}} :::MLLOG {"namespace": "", "time_ms": 1687469343278, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "STFC", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 146}} :::MLLOG {"namespace": "", "time_ms": 1687469343364, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "SciML", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 147}} :::MLLOG {"namespace": "", "time_ms": 1687469343444, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "research", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 148}} :::MLLOG {"namespace": "", "time_ms": 1687469343739, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "AMD MI250", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 149}} :::MLLOG {"namespace": "", "time_ms": 1687469343817, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 150}} :::MLLOG {"namespace": "", "time_ms": 1687469343894, "event_type": "POINT_IN_TIME", "key": "number_of_ranks", "value": 2, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 153}} :::MLLOG {"namespace": "", "time_ms": 1687469343975, "event_type": "POINT_IN_TIME", "key": "number_of_nodes", "value": 1, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 154}} :::MLLOG {"namespace": "", "time_ms": 1687469344055, "event_type": "POINT_IN_TIME", "key": "accelerators_per_node", "value": 8, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 155}} :::MLLOG {"namespace": "", "time_ms": 1687469344132, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 156}} :::MLLOG {"namespace": "", "time_ms": 1687469344211, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start:Loading datasets", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 157}} :::MLLOG {"namespace": "", "time_ms": 1687469368432, "event_type": "POINT_IN_TIME", "key": "eval_stop", "value": "Stop: Loading datasets", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 175}} :::MLLOG {"namespace": "", "time_ms": 1687469368520, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start: Loading model", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 176}} /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=None. warnings.warn(msg) :::MLLOG {"namespace": "", "time_ms": 1687469369069, "event_type": "POINT_IN_TIME", "key": "eval_stop", "value": "Stop: Loading model", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 181}} :::MLLOG {"namespace": "", "time_ms": 1687469369152, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start: Training", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 189}} [omnitrace][2110366] fork() called on PID 2110366 (rank: 0), TID 0 /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun like so: srun python /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemd ... rank_zero_warn( Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 Traceback (most recent call last): File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/dteixeir/OMNITRACE/omnitrace_install/lib/python3.8/site-packages/omnitrace/main.py", line 404, in main(args) File "/home/dteixeir/OMNITRACE/omnitrace_install/lib/python3.8/site-packages/omnitrace/main.py", line 290, in main raise RuntimeError( RuntimeError: Could not determine input script in '--config ./stemdlConfig.yaml'. Use '--' before the script and its arguments to ensure correct parsing. E.g. python -m omnitrace -- ./script.py

jrmadsen commented 1 year ago

Only difference is I am not using slurm

Ah yeah, I’m running this on Lockhart and without using SLURM, I end up with only 1 CPU available to me (e.g. nproc returns 1) whereas srun nproc returns 128. Given all the threads that are created, I figured that was desirable and maybe just an omission in the instructions. As it turns out, I assumed, incorrectly, that the execution model would be the same.

It appears PyTorch will make even more forks when nproc < ngpu and these forks appear to not retain the variable I stored in #291 to re-patch sys.argv. Storing it in an environment variable in #292 appears to do the trick.

jrmadsen commented 1 year ago

By the way, if you are also running on Lockhart, I'd highly recommend using srun. PyTorch may try to compensate by forking instead of creating threads but from viewing top while that code was running, all 4 of the forked processes were all sharing the same CPU (i.e. their CPU% was all roughly ~25% instead of ~100%, which is what you would see if they were running on separate CPUs)

daviteix commented 1 year ago

Thanks #292 fixed the issue.