Preprocessing: why are sequences log-transformed?

around1991 commented 5 years ago

Hey, I'm just having a look through the code, and I was wondering why in preprocess.py documents are transformed into their log counts and then serialized?

johnglover commented 5 years ago

@around1991: Why in this case is because this preprocessing is taken from the original DocNADE paper, where they note:

we followed the same evaluation as in Salakhutdinov and Hinton [2]: word counts were replaced by log(1 + n) rounded to the closest integer

So they took it from the Replicated Softmax paper, where the authors say:

For all datasets, each word count w was replaced by log(1 + w), rounded to the nearest integer, which slightly improved retrieval performance of both models.

So the short version is "because it worked" :). I guess the intuition is that small differences in counts were found to be less significant than larger differences, but they don't elaborate anywhere as far as I'm aware (or remember), so this is just speculation. I believe that I tried it without this too and confirmed that log counts worked better, but it was a couple of years ago now so I'm not sure.

Hope that helps.

around1991 commented 5 years ago

Hey John, thanks for the reply. I thought a bit as well, and another thing it does is to make the model a bit quicker to run as well because it reduces word counts. However, I guess if I care about perplexity, then I shouldn't do the log transform?

johnglover commented 5 years ago

That's a good question. Again from the DocNADE paper, the rest of that passage suggests that they do use the log counts in their perplexity:

we followed the same evaluation as in Salakhutdinov and Hinton [2]: word counts were replaced by log(1 + n) rounded to the closest integer and a subset of 50 test documents (2193 words for 20 Newsgroups, 4716 words for RCV1-v2) were used to estimate the test perplexity per word

AYLIEN / docnade

Preprocessing: why are sequences log-transformed? #7