Some problems about the implementation of RORL

Hi,

I am currently encountering some issues while trying to implement RORL，here is the problems：

The training time for RORL seems to be quite long (due to the additional calculation of 3 losses). Therefore, I'd like to ask, based on the gym environments you have tested, which one converges faster? This way, I can more quickly determine if the results I'm currently reproducing match those in the paper. (If you could provide training curves for each environment (training epoch vs. d4rl score), that would be great.)
Do you have any recommendations on how to set the hyperparameters for Q smooth loss, OOD loss, and policy smooth loss? (e.g., based on the range of state and action?)
I noticed that the actor loss (policy gradient + entropy, which is the original loss of SAC actor) increases over the training process (due to the increasing Q value), while the policy smooth loss remains very small (e.g., 0.000025). The proportion between them seems disproportionate. Would this imbalance cause the policy smooth loss to be ineffective in influencing the actor's update?
If I understand correctly, the purpose of Equation 4 is to reduce the Q value of the critic for OOD states/actions by subtracting the uncertainty u(s^, a^). However, after subtraction, only u itself should remain. Why subtract both instead of just keeping u? Is it to emphasize setting a "lower Q" as the update target?

Sorry for the multiple questions. Thank you for your help.

Hi,

I am currently encountering some issues while trying to implement RORL，here is the problems：

The training time for RORL seems to be quite long (due to the additional calculation of 3 losses). Therefore, I'd like to ask, based on the gym environments you have tested, which one converges faster? This way, I can more quickly determine if the results I'm currently reproducing match those in the paper. (If you could provide training curves for each environment (training epoch vs. d4rl score), that would be great.)

Do you have any recommendations on how to set the hyperparameters for Q smooth loss, OOD loss, and policy smooth loss? (e.g., based on the range of state and action?)

I noticed that the actor loss (policy gradient + entropy, which is the original loss of SAC actor) increases over the training process (due to the increasing Q value), while the policy smooth loss remains very small (e.g., 0.000025). The proportion between them seems disproportionate. Would this imbalance cause the policy smooth loss to be ineffective in influencing the actor's update?

If I understand correctly, the purpose of Equation 4 is to reduce the Q value of the critic for OOD states/actions by subtracting the uncertainty u(s^, a^). However, after subtraction, only u itself should remain. Why subtract both instead of just keeping u? Is it to emphasize setting a "lower Q" as the update target?

Sorry for the multiple questions. Thank you for your help.

Thank you for your interest in our work. I hope the following responses can address your questions:

Based on my experience, the halfcheetah-medium-v2 converges faster. However, I recommend using walker2d-medium-v2, as a well-implemented approach can exhibit notable improvements over the baselines. Additionally, some training curves are provided in Figure 19 in Appendix.
We have provided the hyperparameters in Table 4. Furthermore, you can find tips for hyperparameter tuning in the Readme of this repository and in Appendix D of our paper. The most crucial hyperparameters in the OOD loss are determined by the data quality and coverage, (whether we need to enforce a sufficiently large OOD penalty), rather than the ranges of states (normalized to a normal distribution) and actions (same range for all tasks).
It was observed that a larger policy smoothing loss can lead to poor performance, hence the trade-off was empirically determined. We have found that policy smoothing loss on this scale is effective.
Equation 4 is utilized to establish a smaller Q target for OOD state-action pairs with larger uncertainty. It bears similarity to the pseudo-target in PBRL (https://arxiv.org/abs/2202.11566). Similar to the conventional TD target, it will be detached from gradients, therefore you cannot directly remove $Q_{\phi}$ from the loss function.

Please feel free to reach out if you require further clarifications or assistance.

Hi, thank you for your responses for my questions. I have clearly understood for questions 1, 3, and 4. (it's clear to understand question 4 by using conventional TD way of implementation).

For question 2, I would like to understand the response to question two more clearly. You mention that

The most crucial hyperparameters in the OOD loss are determined by the data quality and coverage, (whether we need to enforce a sufficiently large OOD penalty)

So If we monitor the OOD loss during offline training, we must ensure that the OOD loss is not too small right? (I have found that in the environment used in my own project, the OOD loss can be very small, less than 10e-7).

Besides, I have a few more questions:

I noticed that in sac.py, roughly between lines 190 and 200, behavior cloning is performed by setting_num_train_steps < policy_eval_start. I'd like to inquire about the effectiveness of this part in your experiments. The reason I ask is because I'm actually using TD3 + RORL for offline training, but I find that this approach alone doesn't yield satisfactory results. The policy fails to learn behaviors that are close to the offline data. However, when I switch to using TD3BC + RORL, it learn faster (due to the supervised property of BC) and RORL can further improve the performance of TD3BC.
Continuing from question 1, have you tried fine-tuning a model trained with BC offline in your RO2O experiments? I found that while BC can effectively enhance offline learning performance, its overly conservative nature is not conducive to online fine-tuning. However, if BC is removed during online fine-tuning, the change in loss leads to a rapid decline in performance in the early stages of fine-tuning, approaching that of a random policy, thereby losing the original intention of offline training.

If there's anything unclear about my questions, please let me know, thank you.

Hi, thank you for your responses for my questions. I have clearly understood for questions 1, 3, and 4. (it's clear to understand question 4 by using conventional TD way of implementation).

For question 2, I would like to understand the response to question two more clearly. You mention that

The most crucial hyperparameters in the OOD loss are determined by the data quality and coverage, (whether we need to enforce a sufficiently large OOD penalty)

So If we monitor the OOD loss during offline training, we must ensure that the OOD loss is not too small right? (I have found that in the environment used in my own project, the OOD loss can be very small, less than 10e-7).

Besides, I have a few more questions:

I noticed that in sac.py, roughly between lines 190 and 200, behavior cloning is performed by setting_num_train_steps < policy_eval_start. I'd like to inquire about the effectiveness of this part in your experiments. The reason I ask is because I'm actually using TD3 + RORL for offline training, but I find that this approach alone doesn't yield satisfactory results. The policy fails to learn behaviors that are close to the offline data. However, when I switch to using TD3BC + RORL, it learn faster (due to the supervised property of BC) and RORL can further improve the performance of TD3BC.

Continuing from question 1, have you tried fine-tuning a model trained with BC offline in your RO2O experiments? I found that while BC can effectively enhance offline learning performance, its overly conservative nature is not conducive to online fine-tuning. However, if BC is removed during online fine-tuning, the change in loss leads to a rapid decline in performance in the early stages of fine-tuning, approaching that of a random policy, thereby losing the original intention of offline training.

If there's anything unclear about my questions, please let me know, thank you.

For your first question, the OOD loss typically does not remain too small (the scale is between 1.0 and 0.01) and tends to decrease over time (in the first several steps may increase). When the policy performs well within the dataset distribution, the OOD loss may indeed be small. An effective measure to consider is the Q value and policy performance. If the Q value is excessively large and policy performance is poor, it is necessary to adjust hyperparameters to increase the OOD loss. Conversely, if the Q value diverges negatively, the OOD loss should be set to a smaller value.

For the other questions:

You can verify that the hyperparameter 'policy_eval_start' is set to 0 in our experiments. Although we did not utilize this technique in most of the experiments, we observed its effectiveness in AntMaze tasks, where it offers a beneficial initial policy for later optimization in a sparse-reward scenario. Additionally, I recommend referring to the paper https://arxiv.org/pdf/2303.14716.pdf, which proposes TD3+BC with ensemble Q functions, similar to your approach.
We did not attempt fine-tuning of a model trained with BC offline in the RO2O experiments, which you can explore. It is possible that this approach could assist in mitigating performance drops during fine-tuning.

For your first question, the OOD loss typically does not remain too small (the scale is between 1.0 and 0.01) and tends to decrease over time (in the first several steps may increase). When the policy performs well within the dataset distribution, the OOD loss may indeed be small. An effective measure to consider is the Q value and policy performance. If the Q value is excessively large and policy performance is poor, it is necessary to adjust hyperparameters to increase the OOD loss. Conversely, if the Q value diverges negatively, the OOD loss should be set to a smaller value.

For the other questions:

You can verify that the hyperparameter 'policy_eval_start' is set to 0 in our experiments. Although we did not utilize this technique in most of the experiments, we observed its effectiveness in AntMaze tasks, where it offers a beneficial initial policy for later optimization in a sparse-reward scenario. Additionally, I recommend referring to the paper https://arxiv.org/pdf/2303.14716.pdf, which proposes TD3+BC with ensemble Q functions, similar to your approach.

We did not attempt fine-tuning of a model trained with BC offline in the RO2O experiments, which you can explore. It is possible that this approach could assist in mitigating performance drops during fine-tuning.

Really appreciate for your responses and thank you for giving the source code of RORL, this is a great work.

YangRui2015 / RORL

Some problems about the implementation of RORL #3