DQN can't find a good policy

frostyduck commented 4 years ago

According your advice, I switched to Stable-Baselines instead of openAI baseline in the Kundur system training.

def main(learning_rate, env):
    tf.reset_default_graph()  
    graph = tf.get_default_graph()

    model = DQN(CustomDQNPolicy, env, learning_rate=learning_rate, verbose=0)
    callback = SaveOnBestTrainingRewardCallback(check_freq=1000, storedData=storedData)
    time_steps = 900000
    model.learn(total_timesteps=int(time_steps), callback=callback)

    print("Saving final model to: " + savedModel + "/" + model_name + "_lr_%s_90w.pkl" % (str(learning_rate)))
    model.save(savedModel + "/" + model_name + "_lr_%s_90w.pkl" % (str(learning_rate)))

However after 900000 steps of training DQN agent cannot find a good policy. Please see average reward progress plot

https://www.dropbox.com/preview/DQN_adaptivenose.png?role=personal

I used the following env settings

case_files_array.append(folder_dir +'/testData/Kundur-2area/kunder_2area_ver30.raw')
case_files_array.append(folder_dir+'/testData/Kundur-2area/kunder_2area.dyr')
dyn_config_file = folder_dir+'/testData/Kundur-2area/json/kundur2area_dyn_config.json'
rl_config_file = folder_dir+'/testData/Kundur-2area/json/kundur2area_RL_config_multiStepObsv.json'

Mu suggestion is that in the baseline scenario kunder_2area_ver30.raw (without system loading), short circuit might not lead to loss of stability during the simulation. Therefore, (perhaps) DQN agent finds a "no action" policy, that so as not to receive the actionPenalty = 2.0. Because according the reward progress plot, during training agent cannot find a policy better than mean reward 603.05. When testing, mean_reward = 603.05 means "no action" policy (please see figure bellow)

https://www.dropbox.com/preview/no%20actions%20case.png?role=personal

However it's only my suggestion, I can wrong. I thought to try scenarios with increasing load in order to get for sure loss of stability during simulation.

Originally posted by @frostyduck in https://github.com/RLGC-Project/RLGC/issues/9#issuecomment-642406121

qhuang-pnl commented 4 years ago

Sorry I cannot open the figures in your dropbox. Probably you did not make it publicly accessible. If possible, please directly post it here or send it to my email qiuhua dot huang at pnnl dot gov.

Is the result based on only one random seed? You may also try different random seeds. It could have a huge difference.

qhuang-pnl commented 4 years ago

I went through your codes and results, the input and configuration files (.raw and .json) and the NN structure are different from our original testing code: https://github.com/RLGC-Project/RLGC/blob/master/src/py/trainKundur2areaGenBrakingAgent.py

I would suggest you changing and making them the same as our original testing code, because we don't know the performance for other combinations/settings

And at least 3 random seeds should be tried.

frostyduck commented 4 years ago

Thank you! I tried initially the code with your original settings (raw and json files, NN structure), therefore I began to change these settings. However, I will repeat more carefully them as the original settings.

And at least 3 random seeds should be tried.

Do you mean to try different np.random.seed()??

qhuang-pnl commented 4 years ago

Set the last parameter 'seed' in DQN class according to https://stable-baselines.readthedocs.io/en/master/modules/dqn.html

frostyduck commented 4 years ago

I have repeated training of Kundur system using original settings. I used Stable-Baselines (DQN agent) instead of openAI baseline.

case_files_array.append(folder_dir +'/testData/Kundur-2area/kunder_2area_ver30.raw')
case_files_array.append(folder_dir+'/testData/Kundur-2area/kunder_2area.dyr')
dyn_config_file = folder_dir+'/testData/Kundur-2area/json/kundur2area_dyn_config.json'
rl_config_file = folder_dir+'/testData/Kundur-2area/json/kundur2area_RL_config_multiStepObsv.json'

class CustomDQNPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomDQNPolicy, self).__init__(*args, **kwargs,
                                           layers=[128, 128],
                                           layer_norm=False,
                                           feature_extraction="mlp")

def main(learning_rate, env):
    tf.reset_default_graph() 
    graph = tf.get_default_graph()
    model = DQN(CustomDQNPolicy, env, learning_rate=learning_rate, verbose=0, seed=5)
    callback = SaveOnBestTrainingRewardCallback(check_freq=1000, storedData=storedData)
    time_steps = 900000
    model.learn(total_timesteps=int(time_steps), callback=callback)

However I've got the same result. For some reason, DQN agent cannot overcome the mark of mean reward equal to ~603.

https://photos.app.goo.gl/SSJyQQsA3vDhz1nt7

I decided run your full original testing code with openAI baseline DQN model. However, I've got the same "~603 problem" policy.

Case id: 0, Fault bus id: Bus3, fault start time: 1,000000, fault duration: 0,585000

| % time spent exploring | 2 | | episodes | 3.27e+03 | | mean 100 episode reward | -940 | | steps | 9e+05 |

Restored model with mean reward: -602.8 Saving final model to: ./previous_model/kundur2area_multistep_581to585_bus2_90w_lr_0.0001_90w.pkl total running time is -99249.84962964058 Java server terminated with PID: 12763 Finished!!

frostyduck commented 4 years ago

Sorry for the slow response. Do you still need help on this issue?

Yes, I still need your help on this issue.

qhuang-pnl commented 4 years ago

Hi,

I believe we did not correctly commit one RL training configuration file. Please use this updated one: https://github.com/RLGC-Project/RLGC/blob/master/testData/Kundur-2area/json/kundur2area_RL_config_multiStepObsv.json

The settings are corresponding to our paper.

frostyduck commented 4 years ago

Hi,

I believe we did not correctly commit one RL training configuration file. Please use this updated one: https://github.com/RLGC-Project/RLGC/blob/master/testData/Kundur-2area/json/kundur2area_RL_config_multiStepObsv.json

The settings are corresponding to our paper.

I tried the DQN-agent training with these settings, however, I faced again with "~603 problem" policy.

Case id: 0, Fault bus id: Bus3, fault start time: 1,000000, fault duration: 0,583000
--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 3.36e+03 |
| mean 100 episode reward | -709     |
| steps                   | 9e+05    |
--------------------------------------
Restored model with mean reward: -603.0
Saving final model to: ./previous_model/kundur2area_multistep_581to585_bus2_90w_lr_0.0001_90w.pkl
total running time is 15627.888870954514_

I think, in your simulation settings, the duration of the short circuits is not long enough to cause loss stability (penalty = - 1000). Therefore, the agent chooses a no-action policy, which is probably consistent with this "~603 problem". Perhaps, in this case, the RL agent has no motivation to find a better policy to reduce negative rewards.

frostyduck commented 3 years ago

@qhuang-pnl , I probably figured out a bug where, during training, the agent cannot overcome the reward boundary of -602. The fact is that during training and testing in the environment (Kundur's scheme), short circuits are not simulated. I checked it out. That is, the agent learns purely on the normal operating conditions of the system. In this case, the optimal policy is never to apply the dynamic brake, i.e. actions are always 0.

I'm guessing it has something to do with the PowerDynSimEnvDef modifications. Initially, you used PowerDynSimEnvDef_v2, and now I am working with PowerDynSimEnvDef_v7

RLGC-Project / RLGC

DQN can't find a good policy #11

Case id: 0, Fault bus id: Bus3, fault start time: 1,000000, fault duration: 0,585000

| % time spent exploring | 2 | | episodes | 3.27e+03 | | mean 100 episode reward | -940 | | steps | 9e+05 |