datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
154 stars 51 forks source link

email domain analysis: list top/bottom working groups by PCA dimension #438

Open sbenthall opened 3 years ago

sbenthall commented 3 years ago

Illustrate the principal components with top/bottom working group tags.

Christovis commented 3 years ago

Could you explain this a bit more. I don't know what is meant by top/bottom and tags? Is what every you describe here partially contained in the Multi-dimensional scaling of ./bigbang/examples/organizations/Full Archive Study.ipynb

Thx :-)

sbenthall commented 3 years ago

Each working group can be seen as a document. Consider, for each email sent to the working group, the domain of the email address sender as a word.

PCA on the set of documents will produce a set of dimensions expressed as weights on each of the email domains.

In the Multi-dimensional scaling section of that notebook, each dimension is summarized by the email domains with the highest and lowest weights.

This issue asks for an additional, alternative way of summarizing the principal components.

Given a principal component and a working group as a document, the dot product of the principal component weights and the "word" count gives that working group a scalar score.

So it is possible, for each component, to show the top five/bottom five working groups according to that score.

Is that a clear explanation?