About the convolution operation in DC algorithm

Hi Liuxueyi, Thank you for your interest in our research, and I apologize for the delayed response.

The convolution layers in DC are designed to capture the relationships between each modality (RTG, state, action) and the prior tokens (RTG, state, action) across timesteps using learned convolution filters. Since the dependencies of RTG, state, and action tokens on prior RTG, state, and action tokens may differ, DC uses three distinct filters—RTG filter, state filter, and action filter—to model these relationships independently. These filters allow the model to capture modality-specific patterns in how each token depends on its history, considering not only its own previous tokens but also those of other modalities. While these filters are applied separately for each modality, their outputs are not independent. After the convolution operation, the outputs from all three filters are combined in later layers, allowing the model to learn the interactions and dependencies between different modalities over time. This design not only captures local temporal relationships within each modality but also enables DC to understand how these modalities influence one another, leading to more informed and coordinated decision-making.

beanie00 / Decision-ConvFormer

About the convolution operation in DC algorithm #3