apostolikas / Language-Specific-Subnetworks

This repo contains investigates the cross-lingual sharing mechanism of multilingual models through their subnetworks
MIT License
0 stars 1 forks source link

KNN + Visualization #9

Closed gergopool closed 1 year ago

gergopool commented 1 year ago

[?] Flatten the 75 masks and do some clustering on it. I am not sure how we could quantify the results though.

However, if we take the 75x144 dimensions and reduce it with PCA or preferably T-SNA to 75x2, then we're able to visualize it on the 2D plane. Languages could have different colours and tasks different markers. Hopefully, we would see a plot where similar tasks and language stand close to each other.

vasilisvyth commented 1 year ago

Regarding clustering, since not every experiment has the same amount of masks we need to use all the 144 masks and just put 0 to the ones that were pruned. We can try to apply hierarchical clustering with 5 clusters (=number of languages) and see whether we have one cluster per language. Then, since we have a hierarchy of clusters we would want to see whether different language-clusters are grouped based on whether they are from the same language family. Since we need some kind of distance to apply clustering, I guess that this approach could be more useful with the head's importance scores.

gergopool commented 1 year ago

I'm not sure what you mean by not having the same amount of masks. We have exactly 75 masks, each 12x12.

While you expect 5 clusters of languages at first stage, it has an equal chance that it starts with 3 clusters of tasks instead. We cannot tell which of these form a larger or a smaller group. I think all we can expect that masks made on the same language and same task should be close. So it would be nice if we could see 15 groups of points, but with the current jaccard similarity scores I doubt that would happen. I think any distance metric would be fine: euclidean, cosine, 1-jaccard.

On the other hand, knn on the importance scores is brilliant. Although it doesn't show how final, pruned networks are similar in architecture, it can show that different languages trigger the same neurons. If a K-means clustering finds the right 15 groups, we can show that importance scores are very similar, even if the masks are not. Or maybe we could show a 15x15 heatmap of distances / cosine similarities across the importance scores of tasks and languages, and a 5x5 within the same task and language, we could maybe see a major difference between the two tables.