Closed nil123532 closed 1 month ago
Thanks for your question. Since task information is not available prior to the model, and it doesn't know which task is being considered, the problem becomes a POMDP problem. Adding a context variable (such as a GRU, as mentioned in the paper) helps address this issue, allowing us to still solve it as an MDP problem. A task will be inferred or identified based on its trajectory information, which includes the history of previous actions, states, and rewards (i.e., (s0, a0, r0, s1, a1, ..., s_t, a_t, r_t)). The GRU’s role is to use this information to learn to identify the task. The capability of the GRU lies in consuming these sequences and learning to identify the task. Therefore, it only needs to consider the given trajecotry to infer the current task, rather than across multiple trajecotries, which is why a reset makes sense. You can also use other methods like transformer to encode a trajectory and infer task infroamtion where you don't need to worry about reset (checkout our paper where we use transformer https://proceedings.mlr.press/v232/caccia23a)
hope this helps.
Thank you! This was super helpful!
Just a quick follow-up question: During TD3-context training, do you need to train the same transformer/RNN across all tasks, or can each task have its own transformer/rnn?
happy you found it useful. Just one for all tasks, that is the beauty of context variable. You don't need to worry how many tasks that you have, you only need one for all.
Wow! Nice finding! Thank you so much for answering my questions!
Hi,
I've noticed in the code that the GRU is always reset after each round of training or inference. Wouldn't this negate the temporal capabilities of the GRU, effectively reducing it to a standard neural network?