ZhicengShi / mclSTExp

mclSTExp: Multimodal Contrastive Learning for Spatial Gene Expression Prediction Using Histology Images
28 stars 0 forks source link

the role of positional embeddings #2

Closed NBitBuilder closed 4 days ago

NBitBuilder commented 1 week ago

Thank you for sharing the code! It’s a valuable contribution to the community, as it will help reproduce the HE+ST experiments.

I do have a concern regarding the use of positional information. Intuitively, we use positional data to provide the model with more contextual knowledge, making the concept of context crucial. However, since this model is trained at the patch-spot level and only a single coordinate of each patch-spot is fed into the network, I’m curious as to why it still works effectively. When we revisit positional embeddings in models like ViT or similar networks, they provide multiple entries (either tokens or instances) of the image/slide along with their coordinates. This mechanism seems to be a key reason why positional encoding works in those cases. Yet, this code lacks such mechanics—why does it still function successfully?

Actually, Figure 1 in the paper led me to believe that the model is trained with a large crop containing multiple patch-spots, and that positional encoding is applied to the entire crop. However, the implementation seems to contradict this approach. Could you clarify this?

NBitBuilder commented 1 week ago

Also, I'm concerned with the attention module for gene expression encoding. Specifically, I'm concerned with the following code from model.py. Since the input is [batch_size, gene_expression_dim], you add a new dimension to dim=0. In this way, the attention score is computed across different gene expressions from different spots. What's the point of ensuring such information flow among spot gene expressions, which are randomly sampled without any context relations?

spot_features = spot_features.unsqueeze(dim=0) spot_embeddings = self.spot_encoder(spot_features) spot_embeddings = self.spot_projection(spot_embeddings) spot_embeddings = spot_embeddings.squeeze(dim=0)

ZhicengShi commented 1 week ago

Thank you for your insightful questions! Let me address them in detail:

On the effectiveness of positional information: You mentioned that the model only feeds the single coordinate of each patch-spot into the network and still works effectively. This is indeed related to how we understand "context." While models like ViT use multi-entry (token or patch) positional information, in our model, even a single patch-spot coordinate can provide spatial constraints for the network. In spatial transcriptomics, each spot represents a specific spatial point. Although the model is trained at the patch-spot level, positional information helps the network establish local spatial relationships. In other words, a single coordinate serves as an "anchor" for each spot, allowing the model to make inferences based on its relative position.

On the discrepancy between Figure 1 and the implementation: Figure 1 may give the impression that the model is trained with large crops containing multiple patch-spots, and positional encoding is applied to the entire crop. However, in practice, the code focuses more on encoding individual patch-spots. This design choice is likely made to simplify the model without sacrificing spatial information integrity. You can think of the model as extracting local features at the patch-spot level, and then using positional information to assist in combining and predicting features, achieving an overall effect. This approach simplifies positional encoding while still capturing spatial context effectively.

On the gene expression flow in the attention module: Regarding your concern about the attention module, particularly why information flow between spot gene expressions is ensured when they are randomly sampled and lack explicit contextual relationships, this can be seen as a form of feature sharing. In spatial transcriptomics, although gene expression from different spots is sampled independently, they originate from the same tissue environment. By applying attention across different spots, the model can capture more global features. This way, even if some spots have noisy or insufficient gene expression, the model can supplement and correct this using information from other spots. The attention mechanism thus enhances the network’s robustness to sparse data, even without explicit contextual relationships between the spots. Of course, we also experimented with applying attention only between spots with explicit contextual relationships, but the results were slightly less effective.

I hope these explanations help clarify the design logic behind the implementation. Once again, thank you for your valuable questions!

------------------ 原始邮件 ------------------ 发件人: "ZhicengShi/mclSTExp" @.>; 发送时间: 2024年9月5日(星期四) 下午2:16 @.>; @.***>; 主题: Re: [ZhicengShi/mclSTExp] the role of positional embeddings (Issue #2)

Also, I'm concerned with the attention module for gene expression encoding. Specifically, I'm concerned with the following code from model.py. Since the input is [batch_size, gene_expression_dim], you add a new dimension to dim=0. In this way, the attention score is computed across different gene expressions from different spots. What's the point of ensuring such information flow among spot gene expressions, which are randomly sampled without any context relations?

spot_features = spot_features.unsqueeze(dim=0) spot_embeddings = self.spot_encoder(spot_features) spot_embeddings = self.spot_projection(spot_embeddings) spot_embeddings = spot_embeddings.squeeze(dim=0)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>