Inference.ipynb - Error initialising torch.distributed using env://

jenniw27 commented 4 months ago

I have currently set up a tile2net project as detailed in the Installation instructions. After running the interfence.ipynb, the command raster.inference() generates the following error. I am currently running it on a local environment with Python 3.11.8, CUDA 11.7 and Pytorch 2.0.1.

INFO       Running ['python', '-m', 'tile2net', 'inference', '--city_info', 'example_dir\\boston common\\tiles\\boston common_256_info.json', '--interactive', '--dump_percent', '0']
ERROR      Command ['python', '-m', 'tile2net', 'inference', '--city_info', 'example_dir\\boston common\\tiles\\boston common_256_info.json', '--interactive', '--dump_percent', '0'] returned non-zero exit status 1.
Stdout: 
Stderr: INFO       Inferencing. Segmentation results will not be saved.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "D:\Jennifer\tile2net\src\tile2net\__main__.py", line 6, in <module>
    argh.dispatch_commands([
  File "d:\Jennifer\anaconda\envs\localenv\Lib\site-packages\argh\dispatching.py", line 358, in dispatch_commands
    dispatch(parser, *args, **kwargs)
  File "d:\Jennifer\anaconda\envs\localenv\Lib\site-packages\argh\dispatching.py", line 183, in dispatch
    for line in lines:
  File "d:\Jennifer\anaconda\envs\localenv\Lib\site-packages\argh\dispatching.py", line 294, in _execute_command
    for line in result:
  File "d:\Jennifer\anaconda\envs\localenv\Lib\site-packages\argh\dispatching.py", line 247, in _call
    result = function(namespace_obj)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Jennifer\tile2net\src\tile2net\namespace.py", line 660, in wrapper
    return func(namespace, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Jennifer\tile2net\src\tile2net\tileseg\inference\__init__.py", line 767, in inference
    inference = Inference(args)
                ^^^^^^^^^^^^^^^
  File "D:\Jennifer\tile2net\src\tile2net\tileseg\inference\__init__.py", line 381, in __init__
    dist.init_process_group(backend='nccl', init_method='env://')
  File "d:\Jennifer\anaconda\envs\localenv\Lib\site-packages\torch\distributed\distributed_c10d.py", line 900, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\Jennifer\anaconda\envs\localenv\Lib\site-packages\torch\distributed\rendezvous.py", line 235, in _env_rendezvous_handler
    rank = int(_get_env_or_raise("RANK"))
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\Jennifer\anaconda\envs\localenv\Lib\site-packages\torch\distributed\rendezvous.py", line 220, in _get_env_or_raise
    raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

It would be much appreciated if there is any support to resolve this issue.

Mary-h86 commented 4 months ago

Hi Jennifer, Thank you for your interest in Tile2Net. How many GPUs are you using? could you run the following for me and report the output?

import torch
print(torch.cuda.device_count())

jenniw27 commented 4 months ago

Thanks for reaching out! Here is the output below.

import torch
print(torch.cuda.device_count())
2

Mary-h86 commented 4 months ago

Thank you for the info and for posting the issue. The issue as I suspected was with having more than one GPU which in our default setting would start a distributed session, while certain environment variables such as RANK were not set, as the process was not initialized with torch.distributed.launch.

This is now fixed in #56. Please do a git pull in your terminal to update Tile2Net with the latest fix. After updating, please re-run the and let me know if you have any other issues.

jenniw27 commented 4 months ago

Thanks for the update. After a git pull, unfortunately, I am still facing the same RANK error with the inference. Here is the new error message again. Let me know if there is any additional information I can provide to support the resolution of this issue.

INFO       Running ['python', '-m', 'tile2net', 'inference', '--city_info', 'example_dir\\boston common\\tiles\\boston common_256_info.json', '--interactive', '--dump_percent', '0']
ERROR      Command ['python', '-m', 'tile2net', 'inference', '--city_info', 'example_dir\\boston common\\tiles\\boston common_256_info.json', '--interactive', '--dump_percent', '0'] returned non-zero exit status 1.
Stdout: 
Stderr: INFO       Inferencing. Segmentation results will not be saved.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "D:\Jennifer\tile2net\src\tile2net\__main__.py", line 6, in <module>
    argh.dispatch_commands([
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 358, in dispatch_commands
    dispatch(parser, *args, **kwargs)
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 183, in dispatch
    for line in lines:
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 294, in _execute_command
    for line in result:
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 247, in _call
    result = function(namespace_obj)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Jennifer\tile2net\src\tile2net\namespace.py", line 660, in wrapper
    return func(namespace, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Jennifer\tile2net\src\tile2net\tileseg\inference\__init__.py", line 739, in inference
    inference = Inference(args)
                ^^^^^^^^^^^^^^^
  File "D:\Jennifer\tile2net\src\tile2net\tileseg\inference\__init__.py", line 353, in __init__
    dist.init_process_group(backend='nccl', init_method='env://')
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\distributed_c10d.py", line 900, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\rendezvous.py", line 235, in _env_rendezvous_handler
    rank = int(_get_env_or_raise("RANK"))
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\rendezvous.py", line 220, in _get_env_or_raise
    raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

jenniw27 commented 4 months ago

To add onto my previous thread, I also tried resolving using the torch.distributed.launch command and this is the output that I received. I am hoping this may provide a better indication of what the issue may be.

(testenv) PS D:\Jennifer\tile2net\src> python -m tile2net generate -l "42.35555189953313, -71.07168915322092, 42.35364837213307, -71.06437423368418" -o "D:\Jennifer\results\Example" -n "example" | python -m torch.distributed.launch tile2net inference
INFO       Geocoding [42.3536483721, -71.0716891532, 42.3555518995, -71.0643742337], this may take awhile...
INFO       Using Massachusetts as the source at location=[42.3536483721, -71.0716891532, 42.3555518995, -71.0643742337]
INFO       Using base_tilesize=256 from source
INFO       Stitching 12 tiles...
           Downloading 0 files...                 :   0%|                                                                                                                          | 0/96 [00:00<?, ?it/s]
INFO       All 96 tiles are on disk.
INFO       All tiles already stitched.
INFO       Dumping to D:\Jennifer\results\Example\example\tiles\example_256_info.json
NOTE: Redirects are currently not supported in Windows or MacOs.
d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
usage: tile2net [-h] {generate,inference} ...
tile2net: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 23600) of binary: d:\Jennifer\anaconda\envs\testenv\python.exe
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\launch.py", line 196, in <module>
    main()
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\launch.py", line 192, in main
    launch(args)
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\launch.py", line 177, in launch
    run(args)
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__    
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tile2net FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-19_22:21:26
  host      : DESKTOP-IJNR0H2.ad.unsw.edu.au
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 23600)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Mary-h86 commented 4 months ago

@jenniw27 Thank you for reporting on the issue! Please do another git pull and re-run. Let me know if the problem persists.

jenniw27 commented 4 months ago

Thanks @Mary-h86. The RANK issue seems to be resolved, however am facing this new issue now. Many thanks for your help so far.

INFO       Running ['python', '-m', 'tile2net', 'inference', '--city_info', 'example_dir\\boston common\\tiles\\boston common_256_info.json', '--interactive', '--dump_percent', '0']
ERROR      Command ['python', '-m', 'tile2net', 'inference', '--city_info', 'example_dir\\boston common\\tiles\\boston common_256_info.json', '--interactive', '--dump_percent', '0'] returned non-zero exit status 1.
Stdout: 
Stderr: INFO       Inferencing. Segmentation results will not be saved.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "D:\Jennifer\tile2net\src\tile2net\__main__.py", line 6, in <module>
    argh.dispatch_commands([
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 358, in dispatch_commands
    dispatch(parser, *args, **kwargs)
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 183, in dispatch
    for line in lines:
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 294, in _execute_command
    for line in result:
  File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 247, in _call
    result = function(namespace_obj)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Jennifer\tile2net\src\tile2net\namespace.py", line 660, in wrapper
    return func(namespace, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Jennifer\tile2net\src\tile2net\tileseg\inference\__init__.py", line 745, in inference
    inference = Inference(args)
                ^^^^^^^^^^^^^^^
  File "D:\Jennifer\tile2net\src\tile2net\tileseg\inference\__init__.py", line 350, in __init__
    if args.eval == 'test':
       ^^^^^^^^^
AttributeError: 'Namespace' object has no attribute 'eval'

Mary-h86 commented 3 months ago

Thank you for bringing this up! The issue is now fixed in #57. Please do a git pull and re-run and let me know if you encounter any other problems!

Mary-h86 commented 3 months ago

@jenniw27 Let me know if the issue still persists!

Mary-h86 commented 3 months ago

I am closing this issue, feel free to re-open if you have any further questions.

VIDA-NYU / tile2net

Inference.ipynb - Error initialising torch.distributed using env:// #53