Closed Liuxueyi closed 1 week ago
Hi Liuxueyi, Thank you for your interest in our research, and I apologize for the delayed response.
The convolution layers in DC are designed to capture the relationships between each modality (RTG, state, action) and the prior tokens (RTG, state, action) across timesteps using learned convolution filters. Since the dependencies of RTG, state, and action tokens on prior RTG, state, and action tokens may differ, DC uses three distinct filters—RTG filter, state filter, and action filter—to model these relationships independently. These filters allow the model to capture modality-specific patterns in how each token depends on its history, considering not only its own previous tokens but also those of other modalities. While these filters are applied separately for each modality, their outputs are not independent. After the convolution operation, the outputs from all three filters are combined in later layers, allowing the model to learn the interactions and dependencies between different modalities over time. This design not only captures local temporal relationships within each modality but also enables DC to understand how these modalities influence one another, leading to more informed and coordinated decision-making.
Hello! Thank you for releasing the code. There is a question when I read the code. The three blocks concluded the convolution layer are sequenltial and process the states, actions and returns concurrently, which are designed for different modality data respectively in the paper. Could you please explain how the convolution layers capture the MDP features in more detail?