chiamp / muzero-cartpole

Applying DeepMind's MuZero algorithm to the cart pole environment in gym
20 stars 1 forks source link

ERROR #2

Open fede72bari opened 1 year ago

fede72bari commented 1 year ago

Dear Chiamp,

Thank you for sharing your effort with the community. I was trying your muZero code and I fixed some not updated issues relative to the use of the discontinued Monitor. Once I fixed those, I experienced this error

=========== TESTING ===========
Total reward: 9.0
C:\Users\Federico\AppData\Roaming\Python\Python310\site-packages\gym\wrappers\record_video.py:41: UserWarning: ←[33mWARN: Overwriting existing videos at D:\Dropbox\PROJECTS\MLTrading\muzero-cartpole\video folder (try specifying a different `video_folder` for the `RecordVideo` wrapper if this is not desired)←[0m
  logger.warn(
Total reward: 10.0
Total reward: 8.0
Total reward: 8.0
Total reward: 10.0
Total reward: 10.0
Total reward: 10.0
Total reward: 10.0
Total reward: 10.0
Total reward: 9.0

Iteration: 1    Total reward: 35.0      Time elapsed: 0.33502347469329835 minutes
Traceback (most recent call last):
  File "D:\Dropbox\PROJECTS\MLTrading\muzero-cartpole\main.py", line 301, in <module>
    self_play(network_model,config)
  File "D:\Dropbox\PROJECTS\MLTrading\muzero-cartpole\main.py", line 48, in self_play
    train(network_model,replay_buffer,optimizer,config)
  File "D:\Dropbox\PROJECTS\MLTrading\muzero-cartpole\main.py", line 184, in train
    loss += (1/num_unroll_steps) * ( mean_squared_error(true_reward,pred_reward) + mean_squared_error(true_value,pred_value) + binary_crossentropy(true_policy,pred_policy) ) # take the average loss among all unroll steps
  File "C:\Users\Federico\anaconda3\envs\mltrading_base\lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\Federico\anaconda3\envs\mltrading_base\lib\site-packages\keras\losses.py", line 2176, in binary_crossentropy
    backend.binary_crossentropy(y_true, y_pred, from_logits=from_logits),
  File "C:\Users\Federico\anaconda3\envs\mltrading_base\lib\site-packages\keras\backend.py", line 5680, in binary_crossentropy
    return tf.nn.sigmoid_cross_entropy_with_logits(
ValueError: `logits` and `labels` must have the same shape, received ((1, 2) vs (2,)).

Did it happen to you too?

By the way is it possible to skip the rendering through any config parameter?

Thank you.

chiamp commented 1 year ago

Hi @fede72bari, thank you very much for trying out my code and bringing to my attention these bugs!

By the way is it possible to skip the rendering through any config parameter?

To skip the rendering, set config['test']['record'] to be False.

ValueError: logits and labels must have the same shape, received ((1, 2) vs (2,)).

This error is caused by a shape mismatch between true_policy (shape (1, 2)) and pred_policy (shape (2,)), found here.

Since this code was working 2 years ago when I first created this project, I imagine since then Tensorflow made a change either to how the binary_crossentropy function or Model inference behaves (i.e. either binary_crossentropy previously allowed shape arguments (1, 2) and (2,), or Model inference previously outputted a shape of (1, 2) for pred_policy).

I've pushed an update that defaults to not recording game renders and fixes the shape mismatch error. Feel free to pull from master or update your code locally to reflect these changes.

Another thing is you may need to alter from tensorflow.keras.optimizers import Adam to from tensorflow.keras.optimizers.legacy import Adam (found here) depending on your Tensorflow version.

Let me know if this works for you and if you have any other questions!

fede72bari commented 1 year ago

Dear Marcus,

Thank you very much, with that two modifications all the errors went away. I launched the main.py script, but in the end, it has not learned anything, I copy here following the final part of the log:

Iteration: 693  Total reward: 10.0      Time elapsed: 275.46939491033555 minutes
Iteration: 694  Total reward: 9.0       Time elapsed: 275.70319045384724 minutes
Iteration: 695  Total reward: 9.0       Time elapsed: 275.9384833137194 minutes
Iteration: 696  Total reward: 9.0       Time elapsed: 276.1845866282781 minutes
Iteration: 697  Total reward: 8.0       Time elapsed: 276.3916021664937 minutes
Iteration: 698  Total reward: 9.0       Time elapsed: 276.6371001879374 minutes
Iteration: 699  Total reward: 9.0       Time elapsed: 276.86878376404445 minutes
Iteration: 700  Total reward: 10.0      Time elapsed: 277.1124445478121 minutes

=========== TESTING ===========
Total reward: 9.0
Total reward: 10.0
Total reward: 10.0
Total reward: 10.0
Total reward: 10.0
Total reward: 9.0
Total reward: 9.0
Total reward: 10.0
Total reward: 10.0
Total reward: 8.0

=========== TESTING ===========
Total reward: 10.0
Total reward: 9.0
Total reward: 9.0
Total reward: 9.0
Total reward: 9.0
Total reward: 8.0
Total reward: 9.0
Total reward: 10.0
Total reward: 9.0
Total reward: 10.0

Should I change any hyperparameter or the default ones should give a good result anyway? Thanks.

chiamp commented 1 year ago

Hi @fede72bari, could you copy and paste the full config dictionary you're using?

One thing you can try is to increase the number of training games in config['self_play']['num_games'], although it should be able to get a reward of more than 10 after 700 games.

fede72bari commented 1 year ago

I am using the default one, I touched nothing; here

    config = { 'env': { 'env_name': env_attributes[env_key_name]['env_name'], # this string gets passed on to the gym.make() function to make the gym environment
                        'state_shape': env_attributes[env_key_name]['state_shape'], # used to define input shape for representation function
                        'action_size': env_attributes[env_key_name]['action_size'] }, # used to define output size for prediction function
               'model': { 'representation_function': { 'num_layers': 2, # number of hidden layers
                                                       'num_neurons': 256, # number of hidden units per layer
                                                       'activation_function': 'relu', # activation function for every hidden layer
                                                       'regularizer': L2(1e-3) }, # regularizer for every layer
                          'dynamics_function': { 'num_layers': 2,
                                                 'num_neurons': 256,
                                                 'activation_function': 'relu',
                                                 'regularizer': L2(1e-3) },
                          'prediction_function': { 'num_layers': 2,
                                                   'num_neurons': 256,
                                                   'activation_function': 'relu',
                                                   'regularizer': L2(1e-3) },
                          'hidden_state_size': 256 }, # size of hidden state representation
               'mcts': { 'num_simulations': 1e2, # number of simulations to conduct, every time we call MCTS
                         'c1': 1.25, # for regulating MCTS search exploration (higher value = more emphasis on prior value and visit count)
                         'c2': 19625 }, # for regulating MCTS search exploration (higher value = lower emphasis on prior value and visit count)
               'self_play': { 'num_games': 700, # number of games the agent plays to train on
                              'discount_factor': 1.0, # used when backpropagating values up mcts, and when calculating bootstrapped value during training
                              'save_interval': 100 }, # how often to save network_model weights and replay_buffer
               'replay_buffer': { 'buffer_size': 1e3, # size of the buffer
                                  'sample_size': 1e2 }, # how many games we sample from the buffer when training the agent
               'train': { 'num_bootstrap_timesteps': 500, # number of timesteps in the future to bootstrap true value
                          'num_unroll_steps': 1e1, # number of timesteps to unroll to match action trajectories for each game sample
                          'learning_rate': 1e-3, # learning rate for Adam optimizer
                          'beta_1': 0.9, # parameter for Adam optimizer
                          'beta_2': 0.999 }, # parameter for Adam optimizer
               'test': { 'num_test_games': 10, # number of times to test the agent using greedy actions
                         'record': False }, # True if you want to record the game renders, False otherwise
               'seed': 0
               }

Now I try to increase it to 5000 and let you know. Did you use muZero also for other more complicated tasks?

Federico.

fede72bari commented 1 year ago

Good morning Marcus,

the task will finish on my slow and old PC tomorrow. Right now I reached slightly more than 2k steps and nothing has changed, here is a copy of the last part of the log. Have you ever tried to run the script with the latest Tensorflow/Keras versions?

=========== TESTING ===========
Total reward: 10.0
Total reward: 9.0
Total reward: 9.0
Total reward: 9.0
Total reward: 9.0
Total reward: 10.0
Total reward: 9.0
Total reward: 10.0
Total reward: 9.0
Total reward: 10.0

Iteration: 2101 Total reward: 10.0      Time elapsed: 853.9610003113746 minutes
Iteration: 2102 Total reward: 10.0      Time elapsed: 854.2539603511492 minutes
Iteration: 2103 Total reward: 8.0       Time elapsed: 854.5072117606799 minutes
Iteration: 2104 Total reward: 11.0      Time elapsed: 854.8174656788508 minutes
Iteration: 2105 Total reward: 10.0      Time elapsed: 855.0993361989657 minutes
Iteration: 2106 Total reward: 9.0       Time elapsed: 855.3646882335345 minutes
Iteration: 2107 Total reward: 9.0       Time elapsed: 855.629625248909 minutes
Iteration: 2108 Total reward: 9.0       Time elapsed: 855.9043658852577 minutes
Iteration: 2109 Total reward: 10.0      Time elapsed: 856.1888570070266 minutes
Iteration: 2110 Total reward: 9.0       Time elapsed: 856.4804415186246 minutes
Iteration: 2111 Total reward: 8.0       Time elapsed: 856.7644188721974 minutes
Iteration: 2112 Total reward: 8.0       Time elapsed: 857.0464346925418 minutes
Iteration: 2113 Total reward: 9.0       Time elapsed: 857.3334466497104 minutes
Iteration: 2114 Total reward: 8.0       Time elapsed: 857.6344233671824 minutes
Iteration: 2115 Total reward: 8.0       Time elapsed: 857.9098409533501 minutes
Iteration: 2116 Total reward: 9.0       Time elapsed: 858.146869846185 minutes
fede72bari commented 1 year ago

Hi again, here the conclusion of the run on 5000 iterations:

Iteration: 4990 Total reward: 8.0       Time elapsed: 1580.9848891019822 minutes
Iteration: 4991 Total reward: 10.0      Time elapsed: 1581.337004073461 minutes
Iteration: 4992 Total reward: 10.0      Time elapsed: 1581.6365315318108 minutes
Iteration: 4993 Total reward: 10.0      Time elapsed: 1581.9259099324545 minutes
Iteration: 4994 Total reward: 10.0      Time elapsed: 1582.1918109059334 minutes
Iteration: 4995 Total reward: 9.0       Time elapsed: 1582.4435933113098 minutes
Iteration: 4996 Total reward: 11.0      Time elapsed: 1582.756149895986 minutes
Iteration: 4997 Total reward: 8.0       Time elapsed: 1583.2015126506487 minutes
Iteration: 4998 Total reward: 9.0       Time elapsed: 1583.4943644444147 minutes
Iteration: 4999 Total reward: 9.0       Time elapsed: 1583.8216611981393 minutes
Iteration: 5000 Total reward: 9.0       Time elapsed: 1584.0659927924473 minutes

=========== TESTING ===========
Total reward: 9.0
Total reward: 9.0
Total reward: 8.0
Total reward: 10.0
Total reward: 10.0
Total reward: 8.0
Total reward: 9.0
Total reward: 10.0
Total reward: 9.0
Total reward: 9.0

=========== TESTING ===========
Total reward: 8.0
Total reward: 9.0
Total reward: 10.0
Total reward: 11.0
Total reward: 9.0
Total reward: 10.0
Total reward: 8.0
Total reward: 10.0
Total reward: 9.0
Total reward: 9.0

I think there could be a silly "bug" correlated with the new versions of Tensorflow; could it be?

chiamp commented 1 year ago

Hmm I'm not sure what it could be if you are running the code unaltered from master. Out of curiosity, could you trying changing tensorflow.keras.optimizers import Adam to from tensorflow.keras.optimizers.legacy import Adam here, and then re-run main.py?

Also, what versions of tensorflow, numpy and gym are you using?

Did you use muZero also for other more complicated tasks?

I've only tried it on the cartpole environment.

fede72bari commented 1 year ago

Sure I am happy to collaborate with you. Here is the list of packages from my Anaconda environment in which the script has been run:


**gym                       0.21.0                   pypi_0    pypi**
gym-anytrading            1.3.2                    pypi_0    pypi
gym-notices               0.0.8              pyhd8ed1ab_0    conda-forge
**gymnasium                 0.26.3          py310haa95532_0**
gymnasium-notices         0.0.1              pyh1a96a4e_0    conda-forge
[...]
nevergrad                 0.6.0                    pypi_0    pypi
**numpy                     1.24.2                   pypi_0    pypi**
oauthlib                  3.2.2                    pypi_0    pypi
openssl                   3.1.0                hcfcfb64_3    conda-forge
opt-einsum                3.3.0                    pypi_0    pypi
[...]
**tensorflow                2.11.0                   pypi_0    pypi**
tensorflow-estimator      2.11.0                   pypi_0    pypi
tensorflow-intel          2.11.0                   pypi_0    pypi
tensorflow-io-gcs-filesystem 0.30.0                   pypi_0    pypi
tensorflow-probability    0.19.0                   pypi_0    pypi
termcolor                 2.2.0                    pypi_0    pypi

Concerning the other issue i run the legacy version otherwise, it crashes:

#from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.legacy import Adam

It could be interesting to see if in case you use the same versions of the mentioned libraries (that should be the latest or almost the latest), you get the same troubles. Maybe I can try to update Gym, I think the latest version should be the 23rd, not sure.

PS: the versions 23 and 26 of Gym give troubles since "Monitor" has been removed so I assume that you tested the model with a previous version with respect to the 23rd one.

chiamp commented 1 year ago

Could you try running it on Python 3.7.4 with the following versions: numpy==1.16.4 tensorflow==2.3.0 gym==0.17.1

Concerning the other issue i run the legacy version otherwise, it crashes:

What is the error message when it crashes?

It could be interesting to see if in case you use the same versions of the mentioned libraries (that should be the latest or almost the latest), you get the same troubles. Maybe I can try to update Gym, I think the latest version should be the 23rd, not sure.

What version of Python are you using, btw? I will try to run it with your versions when I can find the time. Sorry for the late reply!

fede72bari commented 1 year ago

Dear Marcus,

sorry for the delay, busy days. Here is the error I get using in my environment the old optimizer:

2023-05-22 21:46:39.021753: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

=========== TESTING ===========
Total reward: 9.0
Total reward: 10.0
Total reward: 10.0
Total reward: 10.0
Total reward: 10.0
Total reward: 9.0
Total reward: 9.0
Total reward: 9.0
Total reward: 10.0
Total reward: 10.0

Iteration: 1    Total reward: 9.0       Time elapsed: 0.08143569231033325 minutes
Traceback (most recent call last):
  File "D:\Dropbox\PROJECTS\MLTrading\muzero-cartpole\main.py", line 297, in <module>
    self_play(network_model,config)
  File "D:\Dropbox\PROJECTS\MLTrading\muzero-cartpole\main.py", line 48, in self_play
    train(network_model,replay_buffer,optimizer,config)
  File "D:\Dropbox\PROJECTS\MLTrading\muzero-cartpole\main.py", line 190, in train
    optimizer.apply_gradients( zip( grads[1], network_model.dynamics_function.trainable_variables ) )
  File "C:\Users\Federico\anaconda3\envs\mltrading_base\lib\site-packages\keras\optimizers\optimizer_experimental\optimizer.py", line 1140, in apply_gradients
    return super().apply_gradients(grads_and_vars, name=name)
  File "C:\Users\Federico\anaconda3\envs\mltrading_base\lib\site-packages\keras\optimizers\optimizer_experimental\optimizer.py", line 634, in apply_gradients
    iteration = self._internal_apply_gradients(grads_and_vars)
  File "C:\Users\Federico\anaconda3\envs\mltrading_base\lib\site-packages\keras\optimizers\optimizer_experimental\optimizer.py", line 1166, in _internal_apply_gradients
    return tf.__internal__.distribute.interim.maybe_merge_call(
  File "C:\Users\Federico\anaconda3\envs\mltrading_base\lib\site-packages\tensorflow\python\distribute\merge_call_interim.py", line 51, in maybe_merge_call
    return fn(strategy, *args, **kwargs)
  File "C:\Users\Federico\anaconda3\envs\mltrading_base\lib\site-packages\keras\optimizers\optimizer_experimental\optimizer.py", line 1216, in _distributed_apply_gradients_fn
    distribution.extended.update(
  File "C:\Users\Federico\anaconda3\envs\mltrading_base\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 2637, in update
    return self._update(var, fn, args, kwargs, group)
  File "C:\Users\Federico\anaconda3\envs\mltrading_base\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 3710, in _update
    return self._update_non_slot(var, fn, (var,) + tuple(args), kwargs, group)
  File "C:\Users\Federico\anaconda3\envs\mltrading_base\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 3716, in _update_non_slot
    result = fn(*args, **kwargs)
  File "C:\Users\Federico\anaconda3\envs\mltrading_base\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 595, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\Federico\anaconda3\envs\mltrading_base\lib\site-packages\keras\optimizers\optimizer_experimental\optimizer.py", line 1213, in apply_grad_to_update_var
    return self._update_step(grad, var)
  File "C:\Users\Federico\anaconda3\envs\mltrading_base\lib\site-packages\keras\optimizers\optimizer_experimental\optimizer.py", line 216, in _update_step
    raise KeyError(
KeyError: 'The optimizer cannot recognize variable dense_4/kernel:0. This usually means you are trying to call the optimizer to update different parts of the model separately. Please call `optimizer.build(variables)` with the full list of trainable variables before the training loop or use legacy optimizer `tf.keras.optimizers.legacy.{self.__class__.__name__}.'

Concerning the downgrade for instance to numpy==1.16.4 I tried but I have a chain effect: if I change numpy version I should before change python version, but if I change python version I have all the other depending packages blocking this change.