TutteInstitute / vectorizers

Vectorizers for a range of different data types
BSD 3-Clause "New" or "Revised" License
93 stars 23 forks source link

[QUESTION] InformationWeightTransform #118

Open cakiki opened 1 year ago

cakiki commented 1 year ago

Hello everyone! I would be very grateful if you could elaborate a bit on what's going on with the InformationWeightTransform reweighting.

lmcinnes commented 1 year ago

It depends on whether you use the supervised mode or not. The unsupervised mode is pretty easy to explain.

We want to weight each column by how "informative" it is. "Informative" would be telling us something distinctive about the documents, so we really want to know how much information about the documents the column contains. Since the column is, in effect, a distribution of how the word occurs over the the documents we are essentially interested in the amount of information gain we get from this distribution over the background expected distribution. Helpfully that is precisely the KL divergence between the two distributions. So we compute a weight that is the KL divergence between the distribution for the column, and the baseline distribution (the mean distribution over all the columns; which is the distribution of document lengths). That's the basic information weight. This has some advantages over TFIDF: it cares about document lengths -- we expect long documents to have lots of words so find it less important when a word shows up in a long document; it cares about how a term is distributed over the documents it occurs in, rather than just whether it appears; it is more theoretically grounded that IDF which has a somewhat arbitrary formula.

We added some options to tweak this basic idea a little. When counts are very sparse it can be useful to include a background prior (derived from the baseline distribution). It can be helpful to dampen or exaggerate the results that come out, so (if I recall correctly) we can raise the resulting information weights to some power after standardising them to bend them appropriately. The same game can be applied under a supervised model where we have class labels -- now we care about information of the distribution over class labels instead of over documents. In practice we found it can be useful to include some of the weighting from the distribution over documents, so for the supervised mode we compute both and combine them under a weighting scheme.

I hope that helps somewhat.