dimikout3 / MarsExplorer

59 stars 6 forks source link

Testing Trained PPO #6

Closed t-woodw closed 2 years ago

t-woodw commented 2 years ago

Hello, I've trained a PPO using the built-in methodology. I have the results with the checkpoints (for 3000 steps) in my ray_results folder.

I'm trying to figure out how to actually test the trained model (with visual output). After looking through all the code, I think the rollout.py file might be how to do that (please correct me if I'm wrong).

Using GeneralExplorationPolicy as the base directory, I run this at command line: $ Python tests/rollout/rollout.py /home/theuser/ray_results/PPO_custom-explorer_2022-10-24_13-32-173jrhexyp/checkpoint_2991 --run PPO --env mars_explorer:exploConf-v01 --episodes 40 --video-dir /home/theuser/mars_ppo_vids

But I'm met with this error:

Traceback (most recent call last):
  File "tests/rollout/rollout.py", line 493, in <module>
    run(args, parser)
  File "tests/rollout/rollout.py", line 292, in run
    agent = cls(env=args.env, config=config)
  File "/home/theuser/miniconda3/envs/GEP/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 106, in _init_
    Trainer.__init__(self, config, env, logger_creator)
  File "/home/theuser/miniconda3/envs/GEP/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 477, in _init_
    super().__init__(config, logger_creator)
  File "/home/theuser/miniconda3/envs/GEP/lib/python3.8/site-packages/ray/tune/trainable.py", line 249, in _init_
    self.setup(copy.deepcopy(self.config))
  File "/home/theuser/miniconda3/envs/GEP/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 630, in setup
    self._init(self.config, self.env_creator)
  File "/home/theuser/miniconda3/envs/GEP/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 133, in _init
    self.workers = self._make_workers(
  File "/home/theuser/miniconda3/envs/GEP/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 701, in _make_workers
    return WorkerSet(
  File "/home/theuser/miniconda3/envs/GEP/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 79, in _init_
    remote_spaces = ray.get(self.remote_workers(
  File "/home/theuser/miniconda3/envs/GEP/lib/python3.8/site-packages/ray/worker.py", line 1452, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::RolloutWorker.foreach_policy() (pid=223386, ip=[192.168.1.92](https://l.facebook.com/l.php?u=http%3A%2F%2F192.168.1.92%2F%3Ffbclid%3DIwAR2gAxjaTqPb7kkOsNXmvxpMiD1y1SBaRdYCLz2cWfl-RWuXYO1XK6UJCqQ&h=AT1SKM4j7CBY1WtecsrMu2lXGcLfTlg7PZVslOGRzJxHgSWWqK4jYMY5YFg38TLqxzw8KMzmL28lni76PtO1U3sb4DJ_vjmWQ2c05peU0uNk5Ya5Dp121V9-I_yH2xlomb7crvUNKIWEKuddVg))
  File "python/ray/_raylet.pyx", line 443, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 481, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
  File "/home/theuser/miniconda3/envs/GEP/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 366, in _init_
    self.env = _validate_env(env_creator(env_context))
  File "/home/theuser/miniconda3/envs/GEP/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 571, in <lambda>
    lambda env_context: gym.make(env, **env_context)
  File "/home/theuser/miniconda3/envs/GEP/lib/python3.8/site-packages/gym/envs/registration.py", line 145, in make
    return registry.make(id, **kwargs)
  File "/home/theuser/miniconda3/envs/GEP/lib/python3.8/site-packages/gym/envs/registration.py", line 90, in make
    env = spec.make(**kwargs)
  File "/home/theuser/miniconda3/envs/GEP/lib/python3.8/site-packages/gym/envs/registration.py", line 60, in make
    env = cls(**_kwargs)
TypeError: __init__() got an unexpected keyword argument 'initial'
(pid=223386) pygame 2.1.2 (SDL 2.0.16, Python 3.8.5)
(pid=223386) Hello from the pygame community. [https://www.pygame.org/contribute.html](https://l.facebook.com/l.php?u=https%3A%2F%2Fwww.pygame.org%2Fcontribute.html%3Ffbclid%3DIwAR0VzeF7jEpOoFfuA5Y5i68bC9PdbpTgpJ4d5jyKNjpCuCwxn4ztb5OlYlM&h=AT1SKM4j7CBY1WtecsrMu2lXGcLfTlg7PZVslOGRzJxHgSWWqK4jYMY5YFg38TLqxzw8KMzmL28lni76PtO1U3sb4DJ_vjmWQ2c05peU0uNk5Ya5Dp121V9-I_yH2xlomb7crvUNKIWEKuddVg)
(pid=223388) pygame 2.1.2 (SDL 2.0.16, Python 3.8.5)
(pid=223388) Hello from the pygame community. [https://www.pygame.org/contribute.html](https://www.pygame.org/contribute.html?fbclid=IwAR3BxQN7nhtwghszPkoocVfgJ2AjkJaE_RlWsJBiG3kjRzAxrpn9qIS8rBk)

I noticed this comment (# check why conf is not compatible will RLlib (it works on standalone gym)) in explorer.py. Does this have something to do with my issue, or am I way off on something?

Is there a work around for this, so that I might be able to see the trained model in action?

t-woodw commented 2 years ago

I still haven't been able to figure out a solution to this error. I would be very appreciative to anyone that could point me in the right direction!

t-woodw commented 2 years ago

@dimikout3 Hate to bother, but are you familiar with this issue? If not, can you say the best way to visually test a trained model with this repo?

dimikout3 commented 2 years ago

Hi @t-woodw, to be honest I am not familiar with the issue above. I do not think that the comment is related to it. I will try to reproduce your error and come back with some help.

t-woodw commented 2 years ago

Hi @t-woodw, to be honest I am not familiar with the issue above. I do not think that the comment is related to it. I will try to reproduce your error and come back with some help.

Thanks! If it helps at all, the model I’m trying to load was trained using the runner.py file with these args: $ python trainners/runner.py -c trainners/trainnerV2.json -r Level-1

I’ve been able to trace through the issue, and I see that the ‘initial’ is probably coming from the param.pkl file when it is unwrapped with cloudpickle in the env_create portion of the config. Does this maybe have something to do with not registering the env in Ray (it looks like that step is skipped in rollout.py), or a mismatch of envs between runner.py and rollout.py?

Also, can you say whether or not rollout.py is the way to run a trained model with visual output?

Thanks for your help so far!

dimikout3 commented 2 years ago

I’ve been able to trace through the issue, and I see that the ‘initial’ is probably coming from the param.pkl file when it is unwrapped with cloudpickle in the env_create portion of the config. Does this maybe have something to do with not registering the env in Ray (it looks like that step is skipped in rollout.py), or a mismatch of envs between runner.py and rollout.py?

Yes, I believe that the problem is somewhere there, unfortunately I can not invest time in fixing this issue, because I am working on a different project, but I can probably support you.

Also, can you say whether or not rollout.py is the way to run a trained model with visual output?

yes, this is how I did it.

t-woodw commented 2 years ago

I’ve been able to trace through the issue, and I see that the ‘initial’ is probably coming from the param.pkl file when it is unwrapped with cloudpickle in the env_create portion of the config. Does this maybe have something to do with not registering the env in Ray (it looks like that step is skipped in rollout.py), or a mismatch of envs between runner.py and rollout.py?

Yes, I believe that the problem is somewhere there, unfortunately I can not invest time in fixing this issue, because I am working on a different project, but I can probably support you.

Also, can you say whether or not rollout.py is the way to run a trained model with visual output?

yes, this is how I did it.

Do you, by any chance, still have a trained (with runner.py) and working model (with rollout.py) you could upload with which I could compare output? Also, if you do, could you provide the args you pass at command line when you run rollout.py?

Thanks for your help so far!

t-woodw commented 2 years ago

Well, after trying a considerable amount of things over the last week, I happened upon a file in the utils folder called fix_conf.py that seems to be targeted at changing the conf of the param.pkl file of trained models. I ran that, and it updated the env config such that I stopped getting the same error when trying to use rollout.py (notably adding in the missing env_registration code from runner.py also fixed the issue, but I opted for the fix_conf.py solution).

It would have been really great if there was something in the readme or commented in the runner.py or rollout.py files about this.

Now to see if there’s a fix for numpy array to tensor in PyTorch with CUDA 11.7 (so I can use my 3080 instead of my 1080).