ARISE-Initiative / robomimic

robomimic: A Modular Framework for Robot Learning from Demonstration
MIT License
656 stars 201 forks source link

Video cannot be opened #179

Closed noooob-coder closed 3 months ago

noooob-coder commented 3 months ago

When I run python robomimic/scripts/train.py --config robomimic/exps/templates/bc.json --dataset datasets/lift/ph/low_dim_v141.hdf5 --debug the following error occurs. Program cannot run properly. When I delete --debug, although it can run normally, the video in robomimic/bc_trained_models/test/20240723101813/videoscannot be opened normally.

Exception ignored in: <function MjRenderContext.__del__ at 0x7f5dbe1c1dc0>
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/robomimic_venv/lib/python3.8/site-packages/robosuite/utils/binding_utils.py", line 199, in __del__
    self.gl_ctx.free()
  File "/home/user/anaconda3/envs/robomimic_venv/lib/python3.8/site-packages/robosuite/renderers/context/egl_context.py", line 149, in free
    EGL.eglMakeCurrent(EGL_DISPLAY, EGL.EGL_NO_SURFACE, EGL.EGL_NO_SURFACE, EGL.EGL_NO_CONTEXT)
  File "src/errorchecker.pyx", line 58, in OpenGL_accelerate.errorchecker._ErrorChecker.glCheckError
OpenGL.raw.EGL._errors.EGLError: EGLError(
    err = EGL_NOT_INITIALIZED,
    baseOperation = eglMakeCurrent,
    cArguments = (
        <OpenGL._opaque.EGLDisplay_pointer object at 0x7f5dbb869f40>,
        <OpenGL._opaque.EGLSurface_pointer object at 0x7f5dbe544bc0>,
        <OpenGL._opaque.EGLSurface_pointer object at 0x7f5dbe544bc0>,
        <OpenGL._opaque.EGLContext_pointer object at 0x7f5dbe544340>,
    ),
    result = 0
)
Exception ignored in: <function EGLGLContext.__del__ at 0x7f5dbe1c1c10>
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/robomimic_venv/lib/python3.8/site-packages/robosuite/renderers/context/egl_context.py", line 155, in __del__
    self.free()
  File "/home/user/anaconda3/envs/robomimic_venv/lib/python3.8/site-packages/robosuite/renderers/context/egl_context.py", line 149, in free
    EGL.eglMakeCurrent(EGL_DISPLAY, EGL.EGL_NO_SURFACE, EGL.EGL_NO_SURFACE, EGL.EGL_NO_CONTEXT)
  File "src/errorchecker.pyx", line 58, in OpenGL_accelerate.errorchecker._ErrorChecker.glCheckError
OpenGL.raw.EGL._errors.EGLError: EGLError(
    err = EGL_NOT_INITIALIZED,
    baseOperation = eglMakeCurrent,
    cArguments = (
        <OpenGL._opaque.EGLDisplay_pointer object at 0x7f5dbb869f40>,
        <OpenGL._opaque.EGLSurface_pointer object at 0x7f5dbe544bc0>,
        <OpenGL._opaque.EGLSurface_pointer object at 0x7f5dbe544bc0>,
        <OpenGL._opaque.EGLContext_pointer object at 0x7f5dbe544340>,
    ),
    result = 0
)
amandlek commented 3 months ago

This error is normal and expected and can safely be ignored. I'm guessing the run with the debug flag ran normally, and the run without the debug flag didn't finish enough rollouts during evaluation so the video was malformed.

noooob-coder commented 3 months ago

I ignored the EGL_NOT_INITIALIZED and resolved the OpenGL error. However, when running the command python robomimic/scripts/train.py --config robomimic/exps/templates/bc.json --dataset datasets/lift/ph/low_dim_v141.hdf5 --debug, the training stops after two epochs, and the results are output to the /tmp/tmp_trained_models path. Why the training process stops after two epochs? Thank you!

============= Training Dataset =============
SequenceDataset (
    path=datasets/lift/ph/low_dim_v141.hdf5
    obs_keys=('object', 'robot0_eef_pos', 'robot0_eef_quat', 'robot0_gripper_qpos')
    seq_length=1
    filter_key=none
    frame_stack=1
    pad_seq_length=True
    pad_frame_stack=True
    goal_mode=none
    cache_mode=all
    num_demos=200
    num_sequences=9666
)

**************************************************
Warnings generated by robomimic have been duplicated here (from above) for convenience. Please check them carefully.
ROBOMIMIC WARNING(
    No private macro file found!
    It is recommended to use a private macro file
    To setup, run: python /home/user/robomimic/robomimic/scripts/setup_macros.py
)
**************************************************

100%|##########| 3/3 [00:00<00:00, 23.01it/s]
Train Epoch 1
{
    "Cosine_Loss": 0.5587675174077352,
    "L1_Loss": 0.0957380086183548,
    "L2_Loss": 0.19192640483379364,
    "Loss": 0.19192640483379364,
    "Optimizer/policy0_lr": 0.0001,
    "Policy_Grad_Norms": 0.22354567569952147,
    "Time_Data_Loading": 4.790623982747396e-05,
    "Time_Epoch": 0.0021804253260294597,
    "Time_Log_Info": 2.5431315104166665e-06,
    "Time_Process_Batch": 0.00014754931131998698,
    "Time_Train_Batch": 0.001972810427347819
}
video writes to /tmp/tmp_trained_models/test/20240724152323/videos/Lift_epoch_1.mp4
rollout: env=Lift, horizon=10, use_goals=False, num_episodes=2
100%|##########| 2/2 [00:01<00:00,  1.32it/s]

Epoch 1 Rollouts took 0.7548685073852539s (avg) with results:
Env: Lift
{
    "Horizon": 10.0,
    "Return": 0.0,
    "Success_Rate": 0.0,
    "Time_Episode": 0.025162283579508463,
    "time": 0.7548685073852539
}
save checkpoint to /tmp/tmp_trained_models/test/20240724152323/models/model_epoch_1_Lift_success_0.0.pth

Epoch 1 Memory Usage: 1730 MB

100%|##########| 3/3 [00:00<00:00, 357.70it/s]
Train Epoch 2
{
    "Cosine_Loss": 0.5759696165720621,
    "L1_Loss": 0.0873916173974673,
    "L2_Loss": 0.1788101146618525,
    "Loss": 0.1788101146618525,
    "Optimizer/policy0_lr": 0.0001,
    "Policy_Grad_Norms": 0.02232090537078572,
    "Time_Data_Loading": 4.622141520182292e-05,
    "Time_Epoch": 0.00014543930689493815,
    "Time_Log_Info": 3.5842259724934896e-06,
    "Time_Process_Batch": 7.510185241699219e-06,
    "Time_Train_Batch": 8.204380671183268e-05
}
video writes to /tmp/tmp_trained_models/test/20240724152323/videos/Lift_epoch_2.mp4
rollout: env=Lift, horizon=10, use_goals=False, num_episodes=2
100%|##########| 2/2 [00:01<00:00,  1.42it/s]

Epoch 2 Rollouts took 0.7024716138839722s (avg) with results:
Env: Lift
{
    "Horizon": 10.0,
    "Return": 0.0,
    "Success_Rate": 0.0,
    "Time_Episode": 0.02341572046279907,
    "time": 0.7024716138839722
}

Epoch 2 Memory Usage: 1753 MB

finished run successfully!
amandlek commented 3 months ago

This is precisely what the --debug flag is supposed to do - test a training run quickly with 2 epochs of training. To train for longer, simply omit the --debug flag.