DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
9.04k stars 1.7k forks source link

"can't convert np.ndarray of type numpy.object_" on first reset() - something changed with last release? #1534

Closed fede72bari closed 1 year ago

fede72bari commented 1 year ago

❓ Question

I am using this SB3 release "pip install git+https://github.com/DLR-RM/stable-baselines3" re-installed at every new session of Kaggle notebook and recently updated also in my PC environment. Suddenly I get the following error when the first episode end and the reset function is called by SB3 to start the second one:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[25], line 35
     27 model = MaskablePPO("MlpPolicy", 
     28             env, 
     29             verbose=0, 
   (...)
     32             learning_rate=0.0003,
     33             ent_coef = 0.5)
     34            #tensorboard_log="/ppo_cartpole_tensorboard/")
---> 35 model.learn(total_timesteps=total_steps)#, progress_bar=True)

File /opt/conda/lib/python3.10/site-packages/sb3_contrib/ppo_mask/ppo_mask.py:526, in MaskablePPO.learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, use_masking, progress_bar)
    523 callback.on_training_start(locals(), globals())
    525 while self.num_timesteps < total_timesteps:
--> 526     continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, self.n_steps, use_masking)
    528     if continue_training is False:
    529         break

File /opt/conda/lib/python3.10/site-packages/sb3_contrib/ppo_mask/ppo_mask.py:330, in MaskablePPO.collect_rollouts(self, env, callback, rollout_buffer, n_rollout_steps, use_masking)
    324 for idx, done in enumerate(dones):
    325     if (
    326         done
    327         and infos[idx].get("terminal_observation") is not None
    328         and infos[idx].get("TimeLimit.truncated", False)
    329     ):
--> 330         terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
    331         with th.no_grad():
    332             terminal_value = self.policy.predict_values(terminal_obs)[0]

File /opt/conda/lib/python3.10/site-packages/stable_baselines3/common/policies.py:268, in BaseModel.obs_to_tensor(self, observation)
    265     # Add batch dimension if needed
    266     observation = observation.reshape((-1, *self.observation_space.shape))
--> 268 observation = obs_as_tensor(observation, self.device)
    269 return observation, vectorized_env

File /opt/conda/lib/python3.10/site-packages/stable_baselines3/common/utils.py:483, in obs_as_tensor(obs, device)
    475 """
    476 Moves the observation to the given device.
    477 
   (...)
    480 :return: PyTorch tensor of the observation on a desired device.
    481 """
    482 if isinstance(obs, np.ndarray):
--> 483     return th.as_tensor(obs, device=device)
    484 elif isinstance(obs, dict):
    485     return {key: th.as_tensor(_obs, device=device) for (key, _obs) in obs.items()}

TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

Since I have not touched the involved part of the code of my custom environment in the last weeks and the provided data is always the same, I suspect that something could have changed in the last version, if any was released in the last days, of SB3. In fact, before the last two days, I have never had a similar issue and the code was running without error till the learning task reached the limit of the steps. I have checked for "numpy.object_" in the observation data with this code at every _get_obs function call with this code

        # Find object elements and their indices
        object_indices = np.argwhere(np.issubdtype(concatenated.dtype, object))

        # Print object elements and their indices
        if(np.size(object_indices) > 0):
            for index in object_indices:
                row, col = index
                element = array[row, col]
                print(f"Object element '{element}' found at row {row}, column {col}.")
            print('observation array:')
            print(concatenated)
            print('observation array type:')
            print(type(concatenated))
            print('self:')
            print(self)

but this condition is never reached.

I have not enough expertise to inspect the SB3 code, but after reading the error log I wonder if this segment of code

    324 for idx, done in enumerate(dones):
    325     if (
    326         done
    327         and infos[idx].get("terminal_observation") is not None
    328         and infos[idx].get("TimeLimit.truncated", False)
    329     ):
--> 330         terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]

refers to the info returned by the environment in addition to the observations or if it is something of different. The doubt arises because I have used a numpy array as data structure for the returned observations but I kept the freedom, maybe wrong or risky but for sure comfortable for other uses, to return as extra info a pandas dataframe. Could be this the problem and in case are there any constraint/requirement on the info data?

Checklist

qgallouedec commented 1 year ago

Since I have not touched the involved part of the code of my custom environment

Can you provide a minimal code to reproduce this? Please refer to the custom env issue template. Also, please provide the system info (instructions also in the issue template)

fede72bari commented 1 year ago

I cannot release the entire environment yet, but possibly I can publish the reset(), _get_obs(), _get_info() and maybe the init functions. Provided that is ok with the policies here since my attempts go in the direction of a sensible topic (araffin in case is it ok if I publish a part of the code even if some names can suggest the topic I am working on?). In the meantime does anybody knows if in the last few days, there was any change on the part of the code that manages the terminal state of an episode during the learning process and the relative observation data? Secondly, does anybody know if here

terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]

the info array refers to the info returned by the step() and reset() functions of the environment or is it something else? Can be the returned info as a pandas dataset?

Thank you again, I wait for opinions if it is ok to publish part of an environment working for finance purposes.

qgallouedec commented 1 year ago

I cannot release the entire environment yet

And we certainly don't want you to. We just need a minimal example to reproduce the error.

the info array refers to the info returned by the step() and reset() functions

It is indeed the info returned by step and reset. It must have a type Dict[str, Any], so yes, it can be pandas object

Once again, I advise you to follow the custom env issue template, as a lot of information is still missing (in addition to that already mentioned): system info, check_env output, etc. And I may not be able to help you if you don't provide these information.

fede72bari commented 1 year ago

Dear Quentin,

I apologize for the late answer, but I needed time to crosscheck everything above all because I found that the issue was not occurring in every condition. Right now I am perfectly able to reproduce it and I found which is the parameter and its value that trigger that error even if I don't know why. But let's proceed with order; I try to answer your request to have more contextual pieces of information here, if there are others that I could report please let me know:

/opt/conda/lib/python3.10/site-packages/stable_baselines3/common/env_checker.py:97: UserWarning: Discrete action space with a non-zero start is not supported by Stable-Baselines3. You can use a wrapper or update your action space.
  warnings.warn(

actually, this is a little bit hermetic to me. May I ask what does it refers to?

/opt/conda/lib/python3.10/site-packages/stable_baselines3/common/env_checker.py:238: UserWarning: Your observation has an unconventional shape (neither an image, nor a 1D vector). We recommend you to flatten the observation to have only a 1D vector or use a custom policy to properly process the data.
  warnings.warn(

this may be related to the fact that I am returning a number of observation rows equal to the buffer size. I thought, but I could be mistaken, that I had to manage in the environment the correct collection of a number of observations to correctly fill the buffer size. Shouldn't I do it?

AssertionError: The observation returned by the `reset()` method does not match the data type (cannot cast) of the given observation space Box(-100000000.0, 100000000.0, (512, 208), float32). Expected: float32, actual dtype: object

Printing the type of observations I get:

obs type: <class 'numpy.ndarray'> that appears to be different than "dtype: object"

!pip install git+https://github.com/DLR-RM/stable-baselines3
!pip install git+https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

and I am using Gymnasium, not Gym

# RL ENVS
import gymnasium as gym
from gymnasium import spaces
from gymnasium.utils import seeding

After this installation, here are the results of printing the version numbers of the main libraries:

import sys
print('gym v:' + gym.__version__)
print('numpy v:' + str(np.__version__))
print('pandas v:' + str(pd.__version__))
print('Python version:' + str(sys.version)) 
print('stable_baselines3 v:' + str(stable_baselines3.__version__))

gym v:0.28.1
numpy v:1.23.5
pandas v:1.5.3
Python version:3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
stable_baselines3 v:2.0.0a11

The problem came out with the reset() once the first episode was done; the strange thing was that the init function of the environment is using the very same reset function run by the agent at the end of each episode: so the trouble is not triggered at init, but it is triggered after an episode is done. I first checked into the data returned by the reset function as info; but they appear to be perfectly equal. My info structure is a dictionary with 4 fields: the current data datetime, the current OHCL prices, the current mask_array and a DataFrame with all the previous and current states (to be used for debugging and statistics).

    def _get_info(self):

        return {"datetime": self.datetime_index[self.step_counter],  #self.indicators.index[self.step_counter].strftime("%Y-%m-%d %H:%M:%S"),
                "prices": self.prices.iloc[self.step_counter], 
                "mask": self.mask_array,
                "state": self.state}

Here some runtime checks on the returned info structure:

RESET, NEW EPISODE
obs shape: (512, 208)
obs type: <class 'numpy.ndarray'>
info.datetime : 2021-08-02 01:31:00-05:00
info.prices : open     4412.00
close    4412.75
high     4413.00
low      4411.75
Name: 2021-08-02 01:31:00-05:00, dtype: float64
info.mask shape: (5,)
info.state shape: (512, 27)

Episode number: 2
    episode_trades_n: 1696
        positive_episode_trades_n: 458
        negative_episode_trades_n: 456
        flat_episode_trades_n: 885
    episode_steps: 2560
    Rewards: -1.2000000000000002
    self.equity: 199596.0

[...othe episodes...]

**terminated: False
truncated: True**

Episode number: 2
    episode_trades_n: 13381
        positive_episode_trades_n: 3435
        negative_episode_trades_n: 3508
        flat_episode_trades_n: 6653
    episode_steps: 19969
    Rewards: -0.3
    self.equity: 196722.75
RESET, NEW EPISODE
obs shape: (512, 208)
obs type: <class 'numpy.ndarray'>
info.datetime : 2021-08-02 01:31:00-05:00
info.prices : open     4412.00
close    4412.75
high     4413.00
low      4411.75
Name: 2021-08-02 01:31:00-05:00, dtype: float64
info.mask shape: (5,)
info.state shape: (512, 27)

The issue was not triggered every time and now I know why. The episode could end for two reasons:

  1. because the last step reached the end of the training data: truncated = True
  2. or because the agent has lost too much money (the limit can be defined at environment instantiation): terminated = True

Well, the issue was not occurring most of times because it is triggered only by the state true of the truncated variable In the first tests the parameters were inducing to close the episode because of too high loss; right now the losses are limited (hoping a day to have gain) and the model reaches the end of the training data.

So to generate again the trouble it should be enough to:

provided that I am passing as observation a 2D array the second dimension of which, the number of rows, is equal to the buffer size and could be a wrong structure as well as also a part of the problem.

I don't know if this could be enough for inspiring a possible understanding.

qgallouedec commented 1 year ago

actually, this is a little bit hermetic to me. May I ask what does it refers to?

Yep, see #913. The easiest way is probably to create a wrapper to shift the action value.

this may be related to the fact that I am returning a number of observation rows equal to the buffer size. I thought, but I could be mistaken, that I had to manage in the environment the correct collection of a number of observations to correctly fill the buffer size. Shouldn't I do it?

I'm not sure I understand. Here, we're dealing with the environment, which is independent of the model, and therefore of the buffer. I assume you're using a 2D (or n-D) array as an observation. Likewise, it's highly inadvisable to do this with SB3. Once again, the simplest thing to do is to wrap your environment with a wrapper that flattens the observation, for example: https://gymnasium.farama.org/api/wrappers/observation_wrappers/#gymnasium.wrappers.FlattenObservation

Printing the type of observations I get:

No, you've printed the python type, with the built-in type function. Use instead the dtype attribute of the array.

At this point, there are already a lot of things wrong with your environment that are bound to create problems with SB3. So I suggest you work on them, and correct your environment. If the problem persists, I invite you to work on a minimal code to reproduce the error.

fede72bari commented 1 year ago

Thank you Quentin, I agree even if the only remaining trouble on my code seems to be the one linked to the dimensions of the returned observations. I had already managed the issue at the start of discrete space subtracting the correct value at each step. The only doubt I have now is what is the expectation of the SB3 model when the hyperparameter batch_size is set to a value N higher than 1: does it expect as obeservation a mono-dimensional array (just one "row" of observed values) or it expect to have a number of rows equal to N? In the first case the buffer should be managed internally to the model i think, in the second one it is passed from the environment at each step/rest call that was my first solution probably not needed and not correct.

qgallouedec commented 1 year ago

You have to understand that the model and the environment are really independent. If your environment is correctly built (read, if it passes the env checker tests), then it can be used with SB3, regardless of any hyperparameters, especially batch size.

when the hyperparameter batch_size is set to a value N higher than 1

And in SB3, there's no difference between a batch size of 1 or greater.

does it expect as obeservation a mono-dimensional array (just one "row" of observed values)

The answer is in the output of the env checker:

  • second warning
/opt/conda/lib/python3.10/site-packages/stable_baselines3/common/env_checker.py:238: UserWarning: Your observation has an unconventional shape (neither an image, nor a 1D vector). We recommend you to flatten the observation to have only a 1D vector or use a custom policy to properly process the data.
  warnings.warn(

I'm a bit worried that this issue loses readability for other users who might face the same problem. So I suggest that if you still have the problem initially mentioned, you share a minimal code so that we can explain where the problem comes from. If you encounter other problems, please open a new dedicated issue.

fede72bari commented 1 year ago

Dear Quentin, I again agree with you that is important to clarify the context by providing part of the code. It is perfectly clear to me that the environment and the model are two independent entities. But buffering the experiences is something in the middle that stores the outputs of the environment for the inputs and the purpose of the model; a previous small experience of mine with TF-Agents by Tensorflow for which the buffer must be defined externally and passed to the agent at the instantiation moment and not finding something similar in the SB3 PPO model definition, unconsciously leads me to the, probably wrong, conclusion that "somebody", the environment, should have buffered the last N observations for the model, where N was the size of the batch (batch_size = N).

Consequently, this is the code that returned at each step and each reset the last N observations (that are the current one plus the past N-1 ones), and that triggered the issue described originally in this post, but just when the event truncated = True occurs; it doesn't appear at each step, neither it occurs when the end state is due to the condition terminated = True.

    def _get_obs(self):

        # return a batch of indicators and volume of open positions

        # Retrieve last sampling_window states and indicators
        last_states = self.state.iloc[self.step_counter-self.sampling_window+1:self.step_counter+1, :] 
        last_indicators = self.indicators[self.step_counter-self.sampling_window+1:self.step_counter+1, :]

        # Keep just n_open_positions from last_states as feature for the NN 
        reduced_states = last_states['n_open_positions'].to_numpy()  
        reduced_states = reduced_states.reshape(-1, 1)

        # indicators and reduced state columns concatenation 
        concatenated = np.concatenate((last_indicators,reduced_states),axis=1)

        return concatenated #concatenated.to_numpy(dtype=np.float32)

In this case, the function _get_obs(), used in both the reset() and step() function, returns a batch of N observations, the N-th of which is the last one. This was the development that triggered the issue both on learning time and using the checker.

Instead, the issue disappears returning just the last observations array as following coded

   def _get_obs(self):
​
        # return just the last group of indicators and volume of open positions
        last_states = self.state.iloc[self.step_counter]
        reduced_states = last_states['n_open_positions'] 
        reduced_states = np.array([reduced_states])

        last_indicators = self.indicators[self.step_counter]​

        # indicators and reduced state columns concatenation 
        concatenated = np.concatenate([last_indicators, reduced_states]) 
​
        return concatenated 

This last code works correctly both in the learning process and with the checker. Still wondering if the minibatch of observations coming from the current and the past N-1 steps is correctly buffered and managed by the model itself, as at this point I guess.

qgallouedec commented 1 year ago

the environment, should have buffered the last N observations for the model, where N was the size of the batch (batch_size = N).

Ok, I understand, but there's a vocabulary problem, because the batch size is actually the number of interractions sampled by the model at the time of learning. What you're describing sounds more like the number of previous observations stacked.

As I still don't have complete minimal code, it's difficult to help you. I just have the impression that the observation size is not consistent from one step to the next. It seems to me that the size in the second dimension depends on the number of timesteps that have already elapsed. This is a major problem you need to solve. As it looks like you're trying to integrate some sort of "near past" into the returned observation, I'd advise you to have a look at Masked PPO in the SB3 contrib repo. You can also check out the VecFrameStack wrapper, which stacks the last n frames of the wrapped env. (It works with image observations, but it's easy to adapt to vector obs).

Finally, I implore you not to feed this issue any further until you've converged on minimal code to reproduce your error. This issue should be of use to everyone, and I'm simply assisting you in your debugging.

fede72bari commented 1 year ago

Thanks, according to me the issue can be closed since its origin has been identified and summarized as follow:

In case there was a further need of investigating the reason why the issue occurs just when the truncated = True event occurs and not in the other cases, it should be possible to reproduce it with any environment modifying the function that returns the observation following the concept of my first example that returns the observation of the current and previous N-1 steps.