About code of HybridSemanticFusion

A wonderful work! I have a question regarding the implementation of the cluster center calculation. In the original paper, the HSF module is described as follows:

The HSF module follows a three-step paradigm: ① Cluster patch embeddings into K groups using KMeans. ② Compute the anomaly scores of individual clusters by averaging the scores of the corresponding positions in the anomaly map M. ③ Select the cluster with the highest anomaly scores, calculate its centroids, and aggregate them into the final semantic-rich image embedding FI, which encapsulates semantic information about the most abnormal clusters.

However, in the code implementation, after performing KMeans clustering, the code calculates the mean of the cluster centers for each batch, and then takes the mean of these centers to produce the final semantic-rich image embedding. This approach seems to imply that regardless of the initial clustering, the final result might be the same since it's an average of averages.

Here is the relevant part of the code:

class HybridSemanticFusion(nn.Module):
    def __init__(self, k_clusters):
        super(HybridSemanticFusion, self).__init__()
        self.k_clusters = k_clusters
        self.n_aggregate_patch_tokens = k_clusters * 5
        self.cluster_performer = KMeans(n_clusters=self.k_clusters, n_init="auto")

    # @torch.no_grad()
    def forward(self, patch_tokens: list, anomaly_maps: list):
        anomaly_map = torch.mean(torch.stack(anomaly_maps, dim=1), dim=1)
        anomaly_map = torch.softmax(anomaly_map, dim=2)[:, :, 1] # B, L

        # extract most abnormal feats
        selected_abnormal_tokens = []
        k = min(anomaly_map.shape[1], self.n_aggregate_patch_tokens)
        top_k_indices = torch.topk(anomaly_map, k=k, dim=1).indices #(1,100) index
        for layer in range(len(patch_tokens)):
            selected_tokens = patch_tokens[layer]. \
                gather(dim=1, index=top_k_indices.unsqueeze(-1).
                       expand(-1, -1, patch_tokens[layer].shape[-1]))
            selected_abnormal_tokens.append(selected_tokens)

        # use kmeans to extract these centriods
        # Stack the data_preprocess
        stacked_data = torch.cat(selected_abnormal_tokens, dim=2)

        batch_cluster_centers = []
        # Perform K-Means clustering
        for b in range(stacked_data.shape[0]):
            cluster_labels = self.cluster_performer.fit_predict(stacked_data[b, :, :].detach().cpu().numpy()) # cluster label for each abnormal token

            # Initialize a list to store the cluster centers
            cluster_centers = []

            # Extract cluster centers for each cluster
            for cluster_id in range(self.k_clusters):
                collected_cluster_data = []
                for abnormal_tokens in selected_abnormal_tokens:
                    cluster_data = abnormal_tokens[b, :, :][cluster_labels == cluster_id]
                    collected_cluster_data.append(cluster_data)
                collected_cluster_data = torch.cat(collected_cluster_data, dim=0)
                cluster_center = torch.mean(collected_cluster_data, dim=0, keepdim=True)
                cluster_centers.append(cluster_center)

            # Normalize the cluster centers
            cluster_centers = torch.cat(cluster_centers, dim=0)
            cluster_centers = torch.mean(cluster_centers, dim=0)
            batch_cluster_centers.append(cluster_centers)

        batch_cluster_centers = torch.stack(batch_cluster_centers, dim=0)
        batch_cluster_centers = F.normalize(batch_cluster_centers, dim=1)

        return batch_cluster_centers

Thank you for your question!

I apologize for any confusion caused by the differences between our implementation and what is described in the paper. In our approach, we:

Extract the most abnormal tokens.
Group these tokens into clusters.
Calculate the average center of each cluster and fuse them together.

We use a hyperparameter, k_clusters, to control the number of abnormal tokens extracted. Our goal is to gather information from multiple potentially abnormal clusters rather than focusing on just the most abnormal one, as we found this method to provide more stability.

This approach can be thought of as a weighted average function with adaptive weights, rather than a simple average of all abnormal tokens. We welcome any further discussion on this topic and would appreciate suggestions for improvements.

Best regards,

Thank you for your question!

I apologize for any confusion caused by the differences between our implementation and what is described in the paper. In our approach, we:

Extract the most abnormal tokens.

Group these tokens into clusters.

Calculate the average center of each cluster and fuse them together.

We use a hyperparameter, k_clusters, to control the number of abnormal tokens extracted. Our goal is to gather information from multiple potentially abnormal clusters rather than focusing on just the most abnormal one, as we found this method to provide more stability.

This approach can be thought of as a weighted average function with adaptive weights, rather than a simple average of all abnormal tokens. We welcome any further discussion on this topic and would appreciate suggestions for improvements.

Best regards,

Thank you for your detailed response to my previous question. I appreciate the clarification on the implementation of the Hybrid Semantic Fusion (HSF) module in AdaCLIP. However, I still have a concern that I hope you can help me understand better. My confusion is centered around the difference between averaging the cluster centers and directly averaging all the samples. As I understand it after reading the code implement, the process involves:

Selecting 100 abnormal patch tokens.
Clustering these tokens into 20 clusters.
Calculating the center of each cluster by averaging the tokens within each cluster.
Averaging these 20 cluster centers to get a final representation.

My question is, would this final representation be significantly different from directly averaging the 100 abnormal tokens? Intuitively, it seems like both methods should yield a similar result since both involve averaging, albeit at different levels of aggregation. I am trying to understand if there is a substantial difference in the outcome or if one method provides certain benefits over the other. Looking forward to your insights on this matter. Best regards,

As I understand it after reading the code implement, the process involves: Selecting 100 abnormal patch tokens. Clustering these tokens into 20 clusters. Calculating the center of each cluster by averaging the tokens within each cluster. Averaging these 20 cluster centers to get a final representation. My question is, would this final representation be significantly different from directly averaging the 100 abnormal tokens? Intuitively, it seems like both methods should yield a similar result since both involve averaging, albeit at different levels of aggregation. I am trying to understand if there is a substantial difference in the outcome or if one method provides certain benefits over the other.

Thank you for your insightful question! To be honest, I haven't tried directly averaging all the abnormal tokens, but intuitively, the difference might not significantly impact the final performance. It could be worth running a quick experiment to validate this.

I should also mention that the design of the HSF is somewhat heuristic. The primary contributions of AdaCLIP lie in its hybrid multimodal prompts. While developing a more refined image-level fusion method would indeed be beneficial for anomaly detection, AdaCLIP may not fully address this aspect as elegantly.

Thank you very much for your insightful reply. Your perspective provides a practical viewpoint on the implementation. Therefore, I think we can close this issue as it has been satisfactorily resolved.

Thank you once again for your time and expertise. I look forward to the possibility of more discussions and learning opportunities.

caoyunkang / AdaCLIP

About code of HybridSemanticFusion #9