[Question] After train, how to test our own environment? #306

Open zhou-ting-hub opened 6 months ago

zhou-ting-hub commented 6 months ago

Required prerequisites


Thank you for your work. After I successfully run the following train code

cd examples
python --algo CPO --env Custom0-v0

how to test next?


image we modified to omnisafe eval ./examples/runs/CPO-{Custom0-v0}

Method2: run the file ./examples/

LOG_DIR = /examples/runs/CPO-{Custom0-v0}/seed-000-2024-02-29-23-33-21

So which is right or how do is right to test the trained model, what is the difference between Method1 and Method2? Is the trained model saved in examples/runs/CPO-{Custom0-v0}/seed-000-2024-02-29-23-33-21? Thank you~


Gaiejj commented 6 months ago

Yes, the location where the experimental results are saved is examples/runs/CPO-{Custom0-v0}/seed-000-2024-02-29-23-33-21. In fact, both methods of evaluating trained policies are fine. The omnisafe eval command provides more extensive command line information, making it more suitable for beginners to use simply; while examples/ offers an interface at the code level for evaluation, facilitating customization for users. For example, using loop logic (e.g. for xxx) to iterate and evaluate all results in a specific folder. If you encounter difficulties in the process of using these two methods for evaluation, feel free to continue providing feedback.

zhou-ting-hub commented 6 months ago

Thank you, to test, I also have tried the file ./examples/ and, that has been solved, but I have another three problems:

Q1: When I use Method2: run the file ./examples/

LOG_DIR = /examples/runs/CPO-{Custom0-v0}/seed-000-2024-02-29-23-33-21

(1) For the same LOG_DIR, why are the results different? In theory, to evaluate the same model, the results shoule be the same. (2) Does the evaluation use the current environment or just use the saved model in LOG_DIR?

Q2: To run the file ./examples/, we want to use gpu, also the environment are equipped with the torchgpu, but the erro is as the following:


Q3: In, in def step,

(1) For the cost function, our goal is to satisfy the equation limits self.P_d+self.P_EL+self.P_EB+self.P_ES=self.P_FC+self.P_PV+self.P_buy

Why the self.P_error is still very large after traing to convergence? I think the cost should tend to 0 in theory. Or how to design the cost?

(2) For the terminated and truncated function, the step should stop following truncated?

 terminated=torch.as_tensor( self.current_step == self.iterations)
 truncated=torch.as_tensor(self.current_step > 92)


Gaiejj commented 6 months ago

Q1 Yes, when we evaluate the trained policy, we only import the trained policy and use a randomly initialized environment. This will cause the results of each evaluation to be inconsistent. If you need the results of each evaluation to be consistent, you need to make the following change in omnisafe/

SEED=5 # for example
from import seed_all

Then in the method __load_model_and_env, after making the env by self._env = make(**env_kwargs), add:


Please note, that to ensure the rigor of the evaluation, use a different random seed for the evaluation than the one used during training.

Q2 Please additionally specify the GPU id. e.g. cuda:0


zhou-ting-hub commented 5 months ago

THANK YOU VERY MUCH! But we face a new problem, we run or Why the train results is the same after we modified our environment under the premise of the same random seed settings in CPO.yaml? For example, we modified the reward function in environment, the train results is the same with before; or modified some variables range in environment, the train results is just a little difference in range, the train results is the same trend with before. In short, the training results did not further learn the modified environment but remained the original decision-making framework. So it is depended on the seed? Should we set the random seed, but the seed is set a constant fixed value in CPO.yaml.

Gaiejj commented 5 months ago

I believe this is due to the issue with the environment's random seed mechanism. The environment currently supported by OmniSafe is Safety-Gymnasium, which is based on Gymnasium, commonly used in the reinforcement learning community. In the random seed setting mechanism of Gymnasium, the environment generates a series of random numbers based on the initial random seed, which are used as seeds for subsequent resets, instead of using the same seed for every seed. For more details please refer to here:

zhou-ting-hub commented 5 months ago

Thank you, but In our environment, we reference the seed seeting in as following:

    def reset(
        seed: int | None = None,
        options: dict[str, Any] | None = None,
    ) -> tuple[torch.Tensor, dict]:
        if seed is not None:
        obs = torch.as_tensor(self._observation_space.sample())
        self._count = 0
        return obs, {}

    def set_seed(self, seed: int) -> None:

Also there is seed=0 in CPO.yaml, how it work? If seed is not None: not run because the seed is None in def reset? random.seed(seed) is the same meaning? when we change it into random.randint(1,10), no work. so what should we modify the seed setting in our environment or in other place to achieve different random reaults.


Gaiejj commented 5 months ago

You can set a simple seed logic, such as adding 10 each time. That is, when you first reset the environment, the seed you pass in is 0, and for subsequent resets, you only need to pass in None, allowing the environment to automatically reset with seeds 10, 20, 30, and so on. Similarly, when your initial seed is 5, the environment will automatically reset in the order of 15, 25, 35, and so on. This logic can be easily implemented in the reset function.

zhou-ting-hub commented 5 months ago

There are two places related to seed, so which seed plays the effect?

  1. In our environment, def reset() and def set _seed() is modified as following,

    def __init__(self):
        self._initial_seed = 0  
        self._current_seed = self._initial_seed  
    def reset(
        seed: int | None = None,
        options: dict[str, Any] | None = None,
     ) -> tuple[torch.Tensor, dict]:
        if seed is not None:
            self._current_seed = seed            
            self._current_seed += 10
        obs = torch.as_tensor(self._observation_space.sample())
        self._count = 0
        return obs, {}
    def set_seed(self, seed: int) -> None:

    According your advice, we modified the above, as a result, we found that although seed is different in each episode such as seed=10,20,30,40,50 responding to five episodes, but when we run next time, seed is the same seed=10,20,30,40,50 , so the train results is still is "Episode reward: 17159.260873794556' that are the same as seed=None or other numbers. So we think the seed setting in def reset() does not play effect.

The following is related to the def set_seed(), but we do not find solve methods. (1)In omnisafe\algorithms\,
self._init_env() (2)In omnisafe\algorithms\on_policy\base\,

self._env: OnPolicyAdapter = OnPolicyAdapter(

(3)In omnisafe\adapter\, super().__init__(env_id, num_envs, seed, cfgs) (4)In omnisafe\adapter\, self._env.set_seed(seed)

  1. In omnisafe\algorithms\, it used cfgs.seed that is seed:0 in CPO.yaml
        assert hasattr(cfgs, 'seed'), 'Please specify the seed in the config file.'
        self._seed: int = int(cfgs.seed) + distributed.get_rank() * 1000

    we modified the above as:

        assert hasattr(cfgs, 'seed'), 'Please specify the seed in the config file.'

    or delete the seed_all(self._seed) as following:

        assert hasattr(cfgs, 'seed'), 'Please specify the seed in the config file.'
        self._seed: int = int(cfgs.seed) + distributed.get_rank() * 1000

we find that the train results can be differnt. We think the second seed place plays effect, that is the seed setting in CPO.yaml plays effect, is it right?
Gaiejj commented 5 months ago

I think I need to clarify the meaning of the seed mechanism:

zhou-ting-hub commented 5 months ago

Thank you, I unsderstand your meaning.

zhou-ting-hub commented 4 months ago

The training reward should increase, cost should decrease.

Q1: Why is reward in a downward trend? Our goal is to minimize economic costs, so we set a negative value, such asreward=-(self.price_e*self.P_buy+self.price_q*self.Q_buy)*1e-4

Q2: Does CPO only support one constraint? Our setting is cost=torch.as_tensor(max(max(0,self.Q_buy-Max_Q_buy),0-self.Q_buy)+max(max(0,self.P_buy-Max_P_buy),0-self.P_buy), there are two constraints. Is this related to the decrease in rewards?

Thank you for your reply!

Gaiejj commented 4 months ago

I'm sorry, but I'm not an expert in applying SafeRL to trading transactions. You need to focus on whether maximizing reward and minimizing cost can coexist simultaneously. For instance, in the Safety-Gymnasium supported by OmniSafe, specifically in SafetyPointGoal1-v0, maximizing reward (reaching the goal) and minimizing cost (avoiding collisions) can coexist, meaning the agent can choose a safe path to the goal. If the environment is designed to meet this condition, then it might be because the default parameters of CPO are not well suited to your task, and you can use examples/benchmarks/ to search for the optimal hyperparameters.

OmniSafe's CPO currently does not support multiple constraints. You can try to handle this by summing up the two cost functions or taking their average, depending on their actual meanings.

zhou-ting-hub commented 4 months ago

Some problems about reward and cost learning curve:

we run the, the epoch is set 4000, the results of agent.plot() and agent.render()are saved in D:\3omnisafe-main\examples\runs\CPO-{Custom0-v0}\seed-001-2024-04-26-17-29-33.


Q1: In def plot() of C:\Users\zyt\.conda\envs\omnisafegpu\Lib\site-packages\omnisafe\, the plot results zyt.png is saved in D:\3omnisafe-main\examples\runs\CPO-{Custom0-v0}\seed-001-2024-04-26-17-29-33\zyt.png

1714139242170 1714140059708 1714141566248

we find that convergence curves in tensorboard is the same with the progess.csv that reflects the training process as following:



we find that convergence curves in tensorboard is the same with the progess.csv that reflects the training process as following:



In conclusion, we find that convergence curves in tensorboard is the same with the progess.csv that reflects the training process, but is different from the above zyt.png, so what doed zyt.png obtained by agent.plot() reflect , it is not the convergence curve?

Q2: In def render() of C:\Users\zyt\.conda\envs\omnisafegpu\Lib\site-packages\omnisafe\, the render results are saved in D:\3omnisafe-main\examples\runs\CPO-{Custom0-v0}\seed-001-2024-04-26-17-29-33\video,

1714140580696 1714140263078

Because we delete the output mp4 in in Lib\site-packages\gymnasium\utils due to error, so we only obtain result.txt in video\epoch1000\epoch2000\epoch3000\epoch4000. what the mp4 refer to? It is necessray?



At the same time, we obtain 'myplot_multitimegpu.csv' , which is the def render() results in our environment .


In conclusion, we set the 'save_model_freq': 1000 in CPO.yaml, the trained model is saved in torch_save\epoch1000\epoch2000\epoch3000\, correspondingly,the render results are saved in video\epoch1000\epoch2000\epoch3000\epoch4000\result.txt , we find the render results in video\epoch4000\result.txt is terrible, so why the training curve in tensorboard (or in progress.csv) has been converged, but the reward and cost values in video\epoch4000\result.txt are terrible than the training curve?

For example, the cost of training curve in tensorboard (or in progress.csv) has been converged to 60, but the cost of render results.txt is big as 17870 (correspondingly the obtained scheduling decision of energy power in myplot_multitimegpu.csv exceeds the limit setted in cost function, even the sum of exceeding value in one epoch is 17870 )? the reward of training curve in tensorboard (or in progress.csv) has been converged to -432, but the reward of render result.txt is small as -648?

But when modified that to be 100, the render results are saved in video\epoch100\200...\3900\4000\result.txt, we find the render results in video\epoch4000\result.txt is still terrible, it looks like not related with 'save_model_freq'.

核心问题就是运行train_from_custom_dict.py里的agent.learn()训练训得很好,试过cost下降都能收敛到0,reward也上升收敛的很好,但是运行train_from_custom_dict.py里的agent.render(),获得的结果 video\epoch4000\result.txt和myplot_multitimegpu.csv都很差,不符合训练的收敛曲线,明显决策变量超出cost的约束很多。





zhou-ting-hub commented 4 months ago

Looking forward to your reply about the above problems, thank you~

Gaiejj commented 4 months ago

I apologize for the late reply. I will address your questions one by one: (1) The curve data in zyt.png is consistent with that in TensorBoard. As for the reason for the discrepancy in visual presentation, it is because TensorBoard automatically ignores excessively high outlier values. For instance, in the TensorBoard chart you showed, the EpCost scale is around 60-180; whereas in zyt.png, the EpCost scale is 0-14000. You only need to use the following code to set the display range for the axis in line 160 of omnisafe/utils/

sub_figures[1].set_ylim(COST_LOWER, COST_UPPER)

(2) I noticed that your main concern is why the agent's performance during render() is inconsistent with training. Addressing your three questions, here are my explanations:

a. The original design intention of render() is to visualize evaluation results, so it serves both as evaluation and visualization. Evaluation includes two aspects: 1. The agent generates actions using a deterministic strategy, not a random strategy. 2. The random seed in the agent's evaluation environment is different from that in the training environment.

b. If your evaluation results are very inconsistent with training, you might consider changing the deterministic strategy to a stochastic strategy, like:

act = self._actor.predict(
        obs.shape[-1],  # to make sure the shape is (1, obs_dim)
    -1,  # to make sure the shape is (act_dim,)

or carefully check whether the environment imported during render() is consistent with the training environment.

zhou-ting-hub commented 3 months ago

Thank you, for (2)b, we modified that in def render() and def evaluate()in as follows, but run evaluator.render(num_episodes=1) and evaluator.evaluate(num_episodes=1) in , results is still very inconsistent with training.

1715661128284 1715661230195

and for (2)a, you said '' The random seed in the agent's evaluation environment is different from that in the training environment. " we set the seed=0 in is the same as seed: 0 in CPO.YAML that used in training. but run evaluator.render(num_episodes=1) and evaluator.evaluate(num_episodes=1) in results are both terrible.

1715660617330(1) 1715660723029