Closed mw9385 closed 1 year ago
and these are my hyperparameters parser.add_argument("--memory_size", type=int, default=20000) parser.add_argument("--random_action", type=int, default=1000)#Don't need seeding for IL (Use 1000 for RL) parser.add_argument("--min_samples_to_start", type=int, default=1000) parser.add_argument("--alpha_init", type=float, default=0.5) parser.add_argument("--soft_update_rate", type=float, default=0.005) parser.add_argument("--mini_batch_size", type=int, default=128) parser.add_argument("--save_period", type=int, default=200) parser.add_argument("--gamma", type=float, default=0.99) parser.addargument("--lambda", type=float, default=0.95) parser.add_argument("--actor_lr", type=float, default=3e-5) parser.add_argument("--q_lr", type=float, default=3e-5) parser.add_argument("--actor_train_epoch", type=int, default=1)
Hi, I observed similar behaviors. In your code you set
reward = (current_Q - y)[is_expert]
and compute the chi2 regularization only for expert "reward" as
chi2_loss = 1/(4 0.5) (reward**2).mean()
which in my experience, lead to divergence. The reason is that, these "rewards" are in fact very large. So if you look at the iq.py file you see that the authors compute chi2 regularization on both the policy's and expert's "reward". In this case I do not have divergence problem, but I am still not able to get good policies, though.
Another thing to point out is that, the authors do not update alpha.
@Altriaex Thanks for your reply! Actually, I computed my reward using both expert and learner data set. In the first loss term, I set my reward as
reward = (current_Q - y)[is_expert]
and then, corresponding loss function is defined as:
loss = -(reward)
In the chi2 regularization, I again set my reward as
reward = (current_Q - y)
and corresponding chi2_loss is defined as:
chi2_loss = (4*0.5) * (reward)**2.mean()
, which I already using both expert and learner data set.
Should I set my first reward from (current_Q-y)[is_expert]
to current_Q - y
for all loss terms or just apply current_Q - y
term for chi2_loss?
I will try without updating alpha. And If you have any loss plot for your own custom environment, could you share it?
Many thanks.
I myself still cannot make this algo work, so I also don't what it the best thing to do.
Thanks. I will try without training alpha and let you know the results :)
@Div99 Hi, the divergence of critic function is a normal phenomenon in IQ learning? Or am I using the code in a wrong way? Thanks in advance :)
Hi, sorry for the delay in replying back. I have observed that for continuous spaces you need to add the chi2 regularization on both the policy and the expert samples. The reason here is that you have a separate policy network in the continuous setting, and without also regularizing policy samples, we can learn large negative rewards for the policy that can diverge toward the negative infinity, preventing the method from converging.
For IQ-Learn on continuous spaces, I will recommend the setting method.regularize=True
to enable the above behavior, try training using a single Q-network (instead of double critic) and try disabling alpha training and playing with small alpha values like 1e-3, 1e-2. If you are using the original code in the repo, you can try one of the settings used in our Mujoco experiments script run_mujoco.sh
For using automatic alpha training, you can see this issue: https://github.com/Div99/IQ-Learn/issues/5
In general, we want the imitation policy to have a very low entropy, as compared to SAC, and setting an entropy_target = -4 * dim(A)
works well on most Mujoco environments when learning the alpha
@Div99 Hi, the divergence of critic function is a normal phenomenon in IQ learning? Or am I using the code in a wrong way? Thanks in advance :)
No, the critic should not diverge if the method is working well. It is likely indicating a bug in the code or a wrong hyperparam setting
@Div99 Thanks for your reply. I was waiting for you! I am running my custom code in vision-based collision avoidance environment. The policy networks get visual inputs and produces collision-free trajectory (3 points in 2D space, dim(A) = 6). The policy network and critic network follows the same structure like the one used in Atari example.
I have tried with the following settings:
method.loss = "value"
method.regularize= True
After training, I got these two loss functions: First one is actor loss and the other one is q loss. I still suffering from q function divergence. When I printed the q values, I could see some large negative values which result in huge q loss. I need to check whether I am using your method correctly again by implementing your original code. Or, can you guess any potential cause of divergence?
We use critic rate=3e-4 so that could be one source of divergence.
Also will recommend trying higher alpha like 1e-2 or 1e-1 if the above fix doesn't help. There also could be a potential issue on how the expert data is generated and whether it matches exactly with the policy data (obs normalization, etc.)
@mw9385 What about printing out your rewards? If you include the chi2 term, in theory you should have very small rewards, which should help you prevent divergence.
For me, it turns out that the key is to use single Q function as critic, as opposed to the SAC's double q solution.
@mw9385 What about printing out your rewards? If you include the chi2 term, in theory you should have very small rewards, which should help you prevent divergence.
For me, it turns out that the key is to use single Q function as critic, as opposed to the SAC's double q solution.
Great! Glad the single q network worked, it's not clear why the double q trick works for SAC but not here, maybe the min prevents learning the correct rewards
@mw9385 What about printing out your rewards? If you include the chi2 term, in theory you should have very small rewards, which should help you prevent divergence.
For me, it turns out that the key is to use single Q function as critic, as opposed to the SAC's double q solution.
When I use double q networks, the reward values are in [-1, 1] range, which are not that high. I will try with single Q-function! Thanks you so much.
We use critic rate=3e-4 so that could be one source of divergence.
Also will recommend trying higher alpha like 1e-2 or 1e-1 if the above fix doesn't help. There also could be a potential issue on how the expert data is generated and whether it matches exactly with the policy data (obs normalization, etc.)
I will try with critic learning rate 3e-4 with single Q-networks. Also, set my initial alpha value as 1e-2. I will let you know my results. Also I will check my network input whether they are correctly normalized.
@Altriaex @Div99 Hi, I have tried with single Q critic and it works. I didn't see any divergence of critic loss. I ran the original code in the repo and my loss shows similar behavior. The reason of divergence is that the critic produces negative output (which means that the critic thinks that current states and actions are bad), and as the training iteration goes on q values become more and more negative resulting in divergence. Using single critic removes this dug.
Many thanks :)
@Div99 Sorry, I have to reopen the issue, because the loss function seems to be very unstable. It is fluctuating in a large magnitude. The followings are my hyperparameters:
I have solved this issue by tuning hyperparameters. Closing this issue.
Hi, Thank you for providing us a wonderful code. I am trying to adopt IQ method in my custom environment. However, I faced with diverging loss critic loss function. I tried to copy and paste the original code from github but this event is happening again and again. Is it a normal event if IQ imitation learning method is combined with SAC or am i using it in a wrong way. I uploaded my code with post. I also upload my loss function together.