DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.96k stars 1.68k forks source link

I have seen the issues#425 problem, but I have new questions about the off policy algorithm. #988

Closed Zero1366166516 closed 2 years ago

Zero1366166516 commented 2 years ago

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

📚 Documentation

The problem of off policy network has been bothering me for several days. I use the example to create a class (customcnn) as the feature extractor and define the policy kwargs = dict( features extractor class=CustomCNN, net arch=dict(qf=[256, 256], pi=[256, 256]) ) CNN neural network is used as the feature extractor, and the code is as follows:

The following is a class I modified according to the example, because I want to make a feature extractor to extract the features of time series. I want to use CNN network.

`class CustomCNN(BaseFeaturesExtractor):
    """`class CustomCNN(BaseFeaturesExtractor):
    """
    :param observation_space: (gym.Space)
    :param features_dim: (int) Number of features extracted.
        This corresponds to the number of unit for the last layer.
    """
def __init__(self, observation_space: gym.spaces.Box, features_dim: int = 1):
    super(CustomCNN, self).__init__(observation_space, features_dim)
    # We assume CxHxW images (channels first)
    # Re-ordering will be done by pre-preprocessing or wrapper
    n_input_channels = observation_space.shape[0]
    self.cnn = nn.Sequential(
        nn.Conv1d(self.features_dim, n_input_channels, kernel_size=1, stride=1, padding=0),
        nn.ReLU(),
        nn.Conv1d(n_input_channels, self.features_dim, kernel_size=1, stride=1, padding=0),
        nn.ReLU(),
        nn.Flatten(),
    )
    with th.no_grad():
        n_flatten = self.cnn(
            th.as_tensor(observation_space.sample()[None]).float()
        ).shape[1]
    self.linear = nn.Sequential(nn.Linear(n_flatten, self.features_dim), nn.Tanh())

Now the problem is in the forword function. The first sampling of the program is the structure of [1,1,13], and the second is the structure of [1,128,13]. So I added a judgment. If the structure changes, redefine nn.sequential.

def forward(self, observations: th.Tensor) -> th.Tensor:
    n_flatten = np.array(observations).shape[1]
    features_dim = np.array(observations).shape[0]
    print(features_dim, n_flatten)
    if features_dim != 1:
        self.cnn = nn.Sequential(
            nn.Conv1d(features_dim, n_flatten, kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
            nn.Conv1d(n_flatten, features_dim, kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
        )
        self.linear = nn.Sequential(nn.Linear(n_flatten, features_dim), nn.Tanh())
    return self.linear(self.cnn(observations))

Error tracking is:

Traceback (most recent call last): File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.1.3\plugins\python-ce\helpers\pydev\pydevd.py", line 1491, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.1.3\plugins\python-ce\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "C:/Users/Administrator/PycharmProjects/demo/utils/models.py", line 487, in trained_sac = agent.train_model( File "C:/Users/Administrator/PycharmProjects/demo/utils/models.py", line 409, in train_model model = model.learn(total_timesteps=total_timesteps, tb_log_name=tb_log_name) File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\sac\sac.py", line 292, in learn return super(SAC, self).learn( File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\common\off_policy_algorithm.py", line 366, in learn self.train(batch_size=self.batch_size, gradient_steps=gradient_steps) File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\sac\sac.py", line 206, in train actions_pi, log_prob = self.actor.action_log_prob(replay_data.observations) File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\sac\policies.py", line 180, in action_log_prob mean_actions, log_std, kwargs = self.get_action_dist_params(obs) File "C:\ProgramData\Anaconda3\lib\site-packages\stable_baselines3\sac\policies.py", line 163, in get_action_dist_params latent_pi = self.latent_pi(features) File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\container.py", line 139, in forward input = module(input) File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(input, **kwargs) File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x128 and 1x256)

I know there is a full connection layer behind the feature extractor, but why [1,1,13] is OK and [1128,13] reports an error. I'm puzzled. Also, can I add a more detailed introduction to the document, about the feature extractor and the full connection layer. I don't know if my analysis is correct. Please help me have a look. Thank you very much!!! I'll send you some more code of the model,

policy_kwargs = dict( features_extractor_class=CustomCNN, net_arch=dict(qf=[256, 256], pi=[256, 256]) ) def get_model( self, model_name: str, policy: str = "MlpPolicy",

policy: str = "MultiInputPolicy",

    policy_kwargs: dict = policy_kwargs,
    model_kwargs: dict = None,
    verbose: int = 1
) -> Any:
    # print("set Debug!")
    if model_name not in MODELS:
        raise NotImplementedError("NotImplementedError")
    if model_kwargs is None:
        model_kwargs = MODEL_KWARGS[model_name]
    if "action_noise" in model_kwargs:
        n_actions = self.env.action_space.shape[-1]                          
        model_kwargs["action_noise"] = NOISE[model_kwargs["action_noise"]](
            mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions)
        )
    print(model_kwargs)   
    model = MODELS[model_name](          
        policy=policy,
        env=self.env,
        tensorboard_log="{}/{}".format(config.TENSORBOARD_LOG_DIR, model_name),
        verbose=verbose,
        policy_kwargs=policy_kwargs,
        **model_kwargs
    )
    return model
def train_model(
    self, model: Any, tb_log_name: str, total_timesteps: int = 5000
    ) -> Any:
    """train model"""
    model = model.learn(total_timesteps=total_timesteps, tb_log_name=tb_log_name)
    return model

if name == "main": from pull_data import Pull_data from preprocessors import FeatureEngineer, split_data from utils import config import time

pull data

#df = Pull_data(config.SSE_50[:2], save_data=False).pull_data()
df = Pull_data(config.SSE_50[:2]).pull_data()
df = FeatureEngineer().preprocess_data(df)
df = split_data(df, '2009-01-01', '2019-01-01')
print(df.head())
# 
stock_dimension = len(df.tic.unique()) # 2
state_space = 1 + 2*stock_dimension + \
    len(config.TECHNICAL_INDICATORS_LIST)*stock_dimension # 23 
print("stock_dimension: {}, state_space: {}".format(stock_dimension, state_space))
env_kwargs = {
    #"stock_dim": stock_dimension,
    "hmax": 100, 
    "initial_amount": 1e6, 
    "buy_cost_pct": 0.001,
    "sell_cost_pct": 0.001,
    #"reward_scaling": 1e-4,
    #"state_space": state_space,
    #"action_space": stock_dimension,
    #"tech_indicator_list": config.TECHNICAL_INDICATORS_LIST
}
# test env
e_train_gym = StockLearningEnv(df=df, **env_kwargs)
## mulpt test
observation = e_train_gym.reset()      
count = 0
for t in range(10):
    action = e_train_gym.action_space.sample()  
    observation, reward, done, info = e_train_gym.step(action)  
    if done:
        break
    count+=1
    time.sleep(0.2)      
print("observation: ", observation)
print("action: ", action)
print("reward: {}, done: {},info: {}".format(reward, done, info))
# test model
env_train, _ = e_train_gym.get_sb_env()
print(type(env_train))
##register_policy('CustomPolicy', CustomPolicy)
##register_policy('CustomActorCriticPolicy', CustomActorCriticPolicy)
agent = DRL_Agent(env= env_train)
SAC_PARAMS = {
    "batch_size": 128,
    "buffer_size": 1000000,
    "learning_rate": 0.0001,
    "learning_starts": 100,
    "ent_coef": "auto_0.1"
}
model_sac = agent.get_model("sac", model_kwargs=SAC_PARAMS)
    trained_sac = agent.train_model(
    model=model_sac,
    tb_log_name='sac', 
    total_timesteps= 50000
)

A clear and concise description of what should be improved in the documentation.

 Checklist

qgallouedec commented 2 years ago

As explained in the issue template and 3 times in #982, we can't help you if you don't provide us with a well formatted minimal code to reproduce the error you encounter. Are you having trouble understanding what this means?

Zero1366166516 commented 2 years ago

Sorry, I just registered before and didn't understand the rules of githu.

Zero1366166516 commented 2 years ago

I try my best to provide a good minimum compliance code. Thank you for your selfless help first! I use the example to create a class (customcnn) as the feature extractor and define the

policy_ kwargs = dict(
features_ extractor_ class=CustomCNN,
net_ arch=dict(qf=[256, 256], pi=[256, 256])
)

The following is a class I modified according to the example, because I want to make a feature extractor to extract the features of time series. I want to use CNN network.

def __init__(self, observation_space: gym.spaces.Box, features_dim: int = 1):
    super(CustomCNN, self).__init__(observation_space, features_dim)
    # We assume CxHxW images (channels first)
    # Re-ordering will be done by pre-preprocessing or wrapper
    n_input_channels = observation_space.shape[0]
    self.cnn = nn.Sequential(
        nn.Conv1d(self.features_dim, n_input_channels, kernel_size=1, stride=1, padding=0),
        nn.ReLU(),
        nn.Conv1d(n_input_channels, self.features_dim, kernel_size=1, stride=1, padding=0),
        nn.ReLU(),
        nn.Flatten(),
    )
    with th.no_grad():
        n_flatten = self.cnn(
            th.as_tensor(observation_space.sample()[None]).float()
        ).shape[1]
    self.linear = nn.Sequential(nn.Linear(n_flatten, self.features_dim), nn.Tanh())

Now the problem is in the forword function. The first sampling of the program is the structure of [1,1,13], and the second is the structure of [1,128,13]. So I added a judgment. If the structure changes, redefine nn.sequential.

def forward(self, observations: th.Tensor) -> th.Tensor:
    n_flatten = np.array(observations).shape[1]
    features_dim = np.array(observations).shape[0]
    print(features_dim, n_flatten)
    if features_dim != 1:
        self.cnn = nn.Sequential(
            nn.Conv1d(features_dim, n_flatten, kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
            nn.Conv1d(n_flatten, features_dim, kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
        )
        self.linear = nn.Sequential(nn.Linear(n_flatten, features_dim), nn.Tanh())
    return self.linear(self.cnn(observations))

The following error occurred:

return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x128 and 1x256)

I know there is a full connection layer behind the feature extractor, but why [1,1,13] is OK and [1128,13] reports an error. I'm puzzled. Also, can you add a more detailed introduction to the document, about the feature extractor and the full connection layer. I don't know whether what I'm writing now meets the requirements, which has caused you trouble. Sorry.

Zero1366166516 commented 2 years ago

I want to use mlppolicy policy network and CNN network as the feature extractor. I don't know whether this method is feasible, or I must customize the policy network to achieve this purpose.

qgallouedec commented 2 years ago

I want to use mlppolicy policy network and CNN network as the feature extractor.

I think you don't have a clear understanding of policy in SB3 : the feature extractor is the first stage of any policy. You should read the documentation: SB3 Policy

I don't know whether what I'm writing now meets the requirements, which has caused you trouble. Sorry.

It doesn't. You need to provide a code, that I can just copy-paste and run to reproduce the error. WARNING: this code has to be MINIMAL: if one line can be removed without removing the error, your code is not minimal.

qgallouedec commented 2 years ago

I also understand that you want to implement a feature extractor with Conv1D. If that's the case, you have to check the other issues that discuss this topic

Zero1366166516 commented 2 years ago

Thank you for your help. I'm writing a DRL algorithm about stock portfolio return. The idea is to try to use MLP, CNN and LSTM as feature extractors to compare which is the best for financial time series.

I have found the reason for the problem of the above customized CNN feature extractor, and it has been solved. Thank you again for your help.

Excuse me, can you add another example of LSTM feature extractor to the document?

qgallouedec commented 2 years ago

Please provide the fix so that other people can benefit from it.

qgallouedec commented 2 years ago

Excuse me, can you add another example of LSTM feature extractor to the document?

If you think that the documentation can be improved, for example by adding more examples, for feel free to open a PR.

Zero1366166516 commented 2 years ago

OK, I'll send the modified code.This is my rewriting based on the example program. I write a custom feature extractor using CNN neural network on the mlppolicy side rate network

class CustomCNN(BaseFeaturesExtractor):
    """
    :param observation_space: (gym.Space)
    :param features_dim: (int) Number of features extracted.
        This corresponds to the number of unit for the last layer.
    """

    def __init__(self, observation_space: gym.spaces.Box, features_dim: int = 1):
        super(CustomCNN, self).__init__(observation_space, features_dim)
        # We assume CxHxW images (channels first)
        # Re-ordering will be done by pre-preprocessing or wrapper
        n_input_channels = observation_space.shape[0]
        self.cnn = nn.Sequential(
            nn.Conv1d(self.features_dim, n_input_channels, kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
            nn.Conv1d(n_input_channels, self.features_dim, kernel_size=1, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
        )
        # Compute shape by doing one forward pass
        with th.no_grad():
            n_flatten = self.cnn(
                th.as_tensor(observation_space.sample()[None]).float()
            ).shape[1]
            #print("n_flatten", n_flatten)
            ##print("cnn", self.cnn)

        self.linear = nn.Sequential(nn.Linear(n_flatten, features_dim), nn.Tanh())
    def forward(self, observations: th.Tensor) -> th.Tensor:
        with th.no_grad():
            n_flatten = np.array(observations).shape[-1]
            features_dim = np.array(observations).shape[-2]
            #print(features_dim, n_flatten, np.array(observations).shape)
            i = 0
            j = 0
            if features_dim != 1:
                self.cnn = nn.Sequential(
                    nn.Conv1d(features_dim, n_flatten, kernel_size=1, stride=1, padding=0),
                    nn.ReLU(),
                    nn.Conv1d(n_flatten, features_dim, kernel_size=1, stride=1, padding=0),
                    nn.ReLU(),
                    nn.Flatten(),
                )
                self.linear = nn.Sequential(nn.Linear(n_flatten, 1), nn.Tanh())
                i += 1
            else:
                j += 1
                self.cnn = nn.Sequential(
                    nn.Conv1d(features_dim, n_flatten, kernel_size=1, stride=1, padding=0),
                    nn.ReLU(),
                    nn.Conv1d(n_flatten, features_dim, kernel_size=1, stride=1, padding=0),
                    nn.ReLU(),
                    nn.Flatten(),
                )
                self.linear = nn.Sequential(nn.Linear(n_flatten, 1), nn.Tanh())
        return self.linear(self.cnn(observations))

following is code in main program:

 policy_kwargs = dict(
        features_extractor_class=CustomCNN,
        net_arch=dict(qf=[128, 128], pi=[256, 256])
    )
        def get_model(
        self,
        model_name: str,
        policy: str = "MlpPolicy",    
        #policy: str = "MultiInputPolicy",
        policy_kwargs: dict = policy_kwargs,
        model_kwargs: dict = None,
        verbose: int = 1
    ) -> Any:

        # print("set Debug!")

        if model_name not in MODELS:
            raise NotImplementedError("NotImplementedError")

        if model_kwargs is None:
            model_kwargs = MODEL_KWARGS[model_name]

        if "action_noise" in model_kwargs:
            n_actions = self.env.action_space.shape[-1]                         
            model_kwargs["action_noise"] = NOISE[model_kwargs["action_noise"]](
                mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions)
            )
        print(model_kwargs)    
        model = MODELS[model_name](       
            policy=policy,
            env=self.env,
            tensorboard_log="{}/{}".format(config.TENSORBOARD_LOG_DIR, model_name),
            verbose=verbose,
            policy_kwargs=policy_kwargs,
            **model_kwargs
        )
        return model
araffin commented 2 years ago

Excuse me, can you add another example of LSTM feature extractor to the document?

As stated in the documentation, only RecurrentPPO (from SB3 contrib) has LSTM support.

Closing as the original question was answered.


The following is an automated answer:

as you seem to try to apply RL to stock trading, i also must warn you about it. Here is recommendation from a former professional trader:

Retail trading, retail trading with ML, and retail trading with RL are bad ideas for almost everyone to get involved with.