Closed jenniw27 closed 3 months ago
Hi Jennifer, Thank you for your interest in Tile2Net. How many GPUs are you using? could you run the following for me and report the output?
import torch
print(torch.cuda.device_count())
Thanks for reaching out! Here is the output below.
import torch
print(torch.cuda.device_count())
2
Thank you for the info and for posting the issue. The issue as I suspected was with having more than one GPU which in our default setting would start a distributed session, while certain environment variables such as RANK were not set, as the process was not initialized with torch.distributed.launch.
This is now fixed in #56.
Please do a git pull
in your terminal to update Tile2Net with the latest fix. After updating, please re-run the and let me know if you have any other issues.
Thanks for the update. After a git pull
, unfortunately, I am still facing the same RANK error with the inference.
Here is the new error message again. Let me know if there is any additional information I can provide to support the resolution of this issue.
INFO Running ['python', '-m', 'tile2net', 'inference', '--city_info', 'example_dir\\boston common\\tiles\\boston common_256_info.json', '--interactive', '--dump_percent', '0']
ERROR Command ['python', '-m', 'tile2net', 'inference', '--city_info', 'example_dir\\boston common\\tiles\\boston common_256_info.json', '--interactive', '--dump_percent', '0'] returned non-zero exit status 1.
Stdout:
Stderr: INFO Inferencing. Segmentation results will not be saved.
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "D:\Jennifer\tile2net\src\tile2net\__main__.py", line 6, in <module>
argh.dispatch_commands([
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 358, in dispatch_commands
dispatch(parser, *args, **kwargs)
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 183, in dispatch
for line in lines:
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 294, in _execute_command
for line in result:
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 247, in _call
result = function(namespace_obj)
^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Jennifer\tile2net\src\tile2net\namespace.py", line 660, in wrapper
return func(namespace, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Jennifer\tile2net\src\tile2net\tileseg\inference\__init__.py", line 739, in inference
inference = Inference(args)
^^^^^^^^^^^^^^^
File "D:\Jennifer\tile2net\src\tile2net\tileseg\inference\__init__.py", line 353, in __init__
dist.init_process_group(backend='nccl', init_method='env://')
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\distributed_c10d.py", line 900, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\rendezvous.py", line 235, in _env_rendezvous_handler
rank = int(_get_env_or_raise("RANK"))
^^^^^^^^^^^^^^^^^^^^^^^^^
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\rendezvous.py", line 220, in _get_env_or_raise
raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
To add onto my previous thread, I also tried resolving using the torch.distributed.launch command and this is the output that I received. I am hoping this may provide a better indication of what the issue may be.
(testenv) PS D:\Jennifer\tile2net\src> python -m tile2net generate -l "42.35555189953313, -71.07168915322092, 42.35364837213307, -71.06437423368418" -o "D:\Jennifer\results\Example" -n "example" | python -m torch.distributed.launch tile2net inference
INFO Geocoding [42.3536483721, -71.0716891532, 42.3555518995, -71.0643742337], this may take awhile...
INFO Using Massachusetts as the source at location=[42.3536483721, -71.0716891532, 42.3555518995, -71.0643742337]
INFO Using base_tilesize=256 from source
INFO Stitching 12 tiles...
Downloading 0 files... : 0%| | 0/96 [00:00<?, ?it/s]
INFO All 96 tiles are on disk.
INFO All tiles already stitched.
INFO Dumping to D:\Jennifer\results\Example\example\tiles\example_256_info.json
NOTE: Redirects are currently not supported in Windows or MacOs.
d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
usage: tile2net [-h] {generate,inference} ...
tile2net: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 23600) of binary: d:\Jennifer\anaconda\envs\testenv\python.exe
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\launch.py", line 196, in <module>
main()
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\launch.py", line 192, in main
launch(args)
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\launch.py", line 177, in launch
run(args)
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tile2net FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-19_22:21:26
host : DESKTOP-IJNR0H2.ad.unsw.edu.au
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 23600)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@jenniw27 Thank you for reporting on the issue!
Please do another git pull
and re-run. Let me know if the problem persists.
Thanks @Mary-h86. The RANK issue seems to be resolved, however am facing this new issue now. Many thanks for your help so far.
INFO Running ['python', '-m', 'tile2net', 'inference', '--city_info', 'example_dir\\boston common\\tiles\\boston common_256_info.json', '--interactive', '--dump_percent', '0']
ERROR Command ['python', '-m', 'tile2net', 'inference', '--city_info', 'example_dir\\boston common\\tiles\\boston common_256_info.json', '--interactive', '--dump_percent', '0'] returned non-zero exit status 1.
Stdout:
Stderr: INFO Inferencing. Segmentation results will not be saved.
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "D:\Jennifer\tile2net\src\tile2net\__main__.py", line 6, in <module>
argh.dispatch_commands([
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 358, in dispatch_commands
dispatch(parser, *args, **kwargs)
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 183, in dispatch
for line in lines:
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 294, in _execute_command
for line in result:
File "d:\Jennifer\anaconda\envs\testenv\Lib\site-packages\argh\dispatching.py", line 247, in _call
result = function(namespace_obj)
^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Jennifer\tile2net\src\tile2net\namespace.py", line 660, in wrapper
return func(namespace, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Jennifer\tile2net\src\tile2net\tileseg\inference\__init__.py", line 745, in inference
inference = Inference(args)
^^^^^^^^^^^^^^^
File "D:\Jennifer\tile2net\src\tile2net\tileseg\inference\__init__.py", line 350, in __init__
if args.eval == 'test':
^^^^^^^^^
AttributeError: 'Namespace' object has no attribute 'eval'
Thank you for bringing this up! The issue is now fixed in #57.
Please do a git pull
and re-run and let me know if you encounter any other problems!
@jenniw27 Let me know if the issue still persists!
I am closing this issue, feel free to re-open if you have any further questions.
I have currently set up a tile2net project as detailed in the Installation instructions. After running the interfence.ipynb, the command raster.inference() generates the following error. I am currently running it on a local environment with Python 3.11.8, CUDA 11.7 and Pytorch 2.0.1.
It would be much appreciated if there is any support to resolve this issue.