clvrai / skill-chaining

Adversarial Skill Chaining for Long-Horizon Robot Manipulation via Terminal State Regularization (CoRL 2021)
https://clvrai.com/skill-chaining
28 stars 4 forks source link

MPI fails when trainer has `--wandb True` #8

Open feup-jmc opened 2 years ago

feup-jmc commented 2 years ago

Good day,

Given how important wandb is in ablation studies, it would be quite helpful to get it running without crashing the script. I understand from #1 that this does not seem to affect your side, however, it is also not an issue with MPI and wandb alone.

Running a test script like the following with mpirun -n 1 is fine.

import json                                                          
import wandb                                                         

wandb_entity="my-entity"                                         
wandb_project="my-project"                                                

exclude = ["device"]                                                 

with open('~/skill-chaining/log/table_lack_0825.gail.p0.123/params.json', "r") as fp:      
    cdict=json.load(fp)                                              

wandb.init(                                                                               
    resume='table_lack_0825.gail.p0.123',                            
    project=wandb_project,                                           
    config={k: v for k, v in cdict.items() if k not in exclude},     
    dir='~/skill-chaining/log/table_lack_0825.gail.p0.123',
    entity=wandb_entity,                                             
    notes='',                                                        
    mode="online",                                                   
)                                                                    

Using MPI with run.py and wandb enabled, however, crashes the script - it is not a resource issue or a native error to the MPI + wandb pair:

$ mpirun -n 1 python -m run --algo gail --furniture_name table_lack_0825 --demo_path demos/table_lack/Sawyer_table_lack_0825_0 --num_connects 1 --run_prefix p0 --gpu 0 --wandb True --max_global_step 100000000 --wandb_entity my-entity --wandb_project my-project
pybullet build time: Apr 21 2022 20:41:06
[DEBUG] Wandb Init Before
~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:228: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
  interpolation: int = Image.BILINEAR,
~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:295: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
  interpolation: int = Image.NEAREST,
~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:328: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  interpolation: int = Image.BICUBIC,
wandb: Currently logged in as: my-team (use `wandb login --relogin` to force relogin)
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting local rank failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[digi2:2953274] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Problem at: ~/skill-chaining/method/robot_learning/main.py 133 _make_log_files
Traceback (most recent call last):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 995, in init
    run = wi.init()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 648, in init
    backend.cleanup()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/backend/backend.py", line 246, in cleanup
    self.interface.join()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 475, in join
    super().join()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface.py", line 653, in join
    _ = self._communicate_shutdown()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 472, in _communicate_shutdown
    _ = self._communicate(record)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 226, in _communicate
    return self._communicate_async(rec, local=local).get(timeout=timeout)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 231, in _communicate_async
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
wandb: ERROR Abnormal program exit
Traceback (most recent call last):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 995, in init
    run = wi.init()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 648, in init
    backend.cleanup()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/backend/backend.py", line 246, in cleanup
    self.interface.join()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 475, in join
    super().join()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface.py", line 653, in join
    _ = self._communicate_shutdown()
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 472, in _communicate_shutdown
    _ = self._communicate(record)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 226, in _communicate
    return self._communicate_async(rec, local=local).get(timeout=timeout)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 231, in _communicate_async
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "~/skill-chaining/run.py", line 44, in <module>
    SkillChainingRun(parser).run()
  File "~/skill-chaining/run.py", line 10, in __init__
    super().__init__(parser)
  File "~/skill-chaining/method/robot_learning/main.py", line 44, in __init__
    self._make_log_files()
  File "~/skill-chaining/method/robot_learning/main.py", line 133, in _make_log_files
    mode="online" if config.wandb else "disabled",
  File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 1033, in init
    raise Exception("problem") from error_seen
Exception: problem

Any ideia what could be the problem?