araffin / rl-baselines-zoo

A collection of 100+ pre-trained RL agents using Stable Baselines, training and hyperparameter optimization included.
https://stable-baselines.readthedocs.io/
MIT License
1.12k stars 206 forks source link

BadZipFile when running PPO2. #109

Closed lubosz closed 3 years ago

lubosz commented 3 years ago

Describe the bug I can't run anything using the ppo2 algo. The file that it tries to open trained_agents/ppo2/CartPole-v1.pkl is not a valid zip file. It produces a zipfile.BadZipFile: File is not a zip file exception.

a2c, acer and acktr seem work, even though the pkl files there are also invalid zip files. The problem seems to be that it wants to open the pkls as zip file for ppo2.

$ python enjoy.py
WARNING:tensorflow:From /usr/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Traceback (most recent call last):
  File "stable_baselines/common/base_class.py", line 657, in _load_from_file
    with zipfile.ZipFile(load_path, "r") as file_:
  File "/usr/lib/python3.8/zipfile.py", line 1269, in __init__
    self._RealGetContents()
  File "/usr/lib/python3.8/zipfile.py", line 1336, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "enjoy.py", line 182, in <module>
    main()
  File "enjoy.py", line 101, in main
    model = ALGOS[algo].load(model_path, env=load_env)
  File "stable_baselines/common/base_class.py", line 938, in load
    data, params = cls._load_from_file(load_path, custom_objects=custom_objects)
  File stable_baselines/common/base_class.py", line 689, in _load_from_file
    data, params = BaseRLModel._load_from_file_cloudpickle(load_path)
  File "stable_baselines/common/base_class.py", line 622, in _load_from_file_cloudpickle
    data, params = cloudpickle.load(file_)
TypeError: an integer is required (got type bytes)

System Info Describe the characteristic of your environment:

araffin commented 3 years ago

Hello, Stable-Baselines does not support TF 2.0+... but issue might come from your python version: see https://github.com/DLR-RM/stable-baselines3/issues/171

The issue seems to come from cloudpickle due to a change in internal python3 API.

It should work with python 3.6 and tensorflow 1.8.0+

lubosz commented 3 years ago

Thanks for your reply.

Stable-Baselines does not support TF 2.0+... Actually you can pretty easily run most TF 1.X code by changing an import:

-import tensorflow as tf
+import tensorflow.compat.v1 as tf

I currently patched stable-baselines, since the original baselines TF2 branch doesn't seem to go anywhere. I also found that you have your own baselines fork, I will look into it.

but issue might come from your python version: see DLR-RM/stable-baselines3#171

I will try to run it with an older Python version.

The issue seems to come from cloudpickle due to a change in internal python3 API.

It's really weird that only ppo2 and sac seem to fail because of this and I am getting the extra zip exception, in addition to the one from cloudpickle.

It should work with python 3.6 and tensorflow 1.8.0+

araffin commented 3 years ago

Actually you can pretty easily run most TF 1.X code by changing an import:

nice, but this probably won't work with the contrib imports.

You might take a look at https://github.com/hill-a/stable-baselines/issues/366 and https://github.com/hill-a/stable-baselines/issues/984

you have your own baselines fork,

?

I am getting the extra zip exception

Probably because the model was saved using the old format: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/base_class.py#L676

See documentation: https://stable-baselines.readthedocs.io/en/master/guide/save_format.html

It's really weird that only ppo2 and sac seem to fail because of this and I am getting the extra zip exception, in addition to the one from cloudpickle.

Look at the linked issue and at https://github.com/DLR-RM/stable-baselines3/issues/172, cloudpickle fail for only some objects like learning rate schedules (defined as lambdas) which are not present in a2c, acer or acktr.

lubosz commented 3 years ago

Thanks for the info.

Probably because the model was saved using the old format: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/base_class.py#L676

This is the solution to my issue. Version incompatibility for playing back the recorded plks. I assume that it will work when I record new ones on my setup.

It is very unfortunate that baselines sees this extensive forking without a clear upstream and maintainership.

?

I was describing stable-baselines3 as "your" fork, since you were linking the issue from it. https://github.com/DLR-RM/stable-baselines3

This looks interesting, unfortunately this is again some fork and not a branch that will get upstreamed: https://github.com/Stable-Baselines-Team/stable-baselines-tf2

araffin commented 3 years ago

It is very unfortunate that baselines sees this extensive forking without a clear upstream and maintainership.

?

I was describing stable-baselines3 as "your" fork, since you were linking the issue from it. https://github.com/DLR-RM/stable-baselines3

Yep, it is not a fork but the next version of Stable-Baselines, same API but much cleaner codebase (and so much easier to maintain and extend). It is now actively developped, v1 of SB3 is coming soon.

This looks interesting, unfortunately this is again some fork and not a branch that will get upstreamed: https://github.com/Stable-Baselines-Team/stable-baselines-tf2

oh, it's not a fork but an experimental repo that I made while we were thinking about the future of SB (see https://github.com/hill-a/stable-baselines/issues/733)