GiR issue with new version of IPTM

bomin8319 commented 6 years ago

I have spent very long time figuring out the possible bugs, but failed. I was able to re-derive exactly the same sampling equation as cluster LDA, so I do not see any mathematical error. IP assignments and topic assignments from backward sampling always has larger variance than those from forward sampling, and the difference gets larger as we run more outer iterations.

To test this in the simplest setting of cluster LDA (no other variables other than cd and z), I ran 'clusterLDA.R' code in GiR2 folder and I get the same odd results. Is there anything I am totally missing?

bdesmarais commented 6 years ago

Could this be caused by label-switching in the backwards sampling? In the forward sampling, we are generating topics from topic distribution k for documents in cluster k, and there is no ambiguity regarding which cluster is topic k in the forward sampling, right? However, in inference, and thus backwards sampling, it seems like label-switching may represent an additional form of variation. Is it possible to artificially introduce label switching into the forward sampling to see if this could produce additional variance in these GiR measures?

aschein commented 6 years ago

Perhaps try calculating the statistics based on the topic-type counts (e.g., N_kv) not the assignments (e.g., z_i). The statistics (e.g., variance) of the counts N_kv should be immune from label-switching.

bomin8319 commented 6 years ago

topic-type counts (N_kv) looks fine in terms of passing GiR. However, current inference (and thus backward samples) converges to one interaction pattern and one (or two) topics across the entire corpus in the long run. For example, if the topic distribution for K=4 is (0.25, 0.25, 0.25, 0.25) in forward sampling, the inferred topic distribution in backward is (0.8, 0.2, 0, 0). I think this is still problematic if it ends up with very few topics remaining in the real data analysis.

I don't quite understand how backward sampling introduces label-switching issue "when we start the inference from the initial values set as true topic distribution".

hannawallach commented 6 years ago

It doesn't sound like label switching is the issue. Bomin, can you point me to the generative process and the inference equations?

On Feb 19, 2018 5:56 PM, "Bomin Kim" notifications@github.com wrote:

topic-type counts (N_kv) looks fine in terms of passing GiR. However, current inference (and thus backward samples) converges to one interaction pattern and one (or two) topics across the entire corpus in the long run. For example, if the topic distribution for K=4 is (0.25, 0.25, 0.25, 0.25) in forward sampling, the inferred topic distribution in backward is (0.8, 0.2, 0, 0). I think this is still problematic if it ends up with very few topics remaining in the real data analysis.

I don't quite understand how backward sampling introduces label-switching issue "when we start the inference from the initial values set as true topic distribution".

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-366765334, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7jUDuOdx0Xg3lWGo9tfQcxZKdQW-ks5tWbXmgaJpZM4SK117 .

bomin8319 commented 6 years ago

You can look at paper/icml2018_style/IPTM_ICML2.pdf. Generating process is in Section 2.2, and the inference equation is Equation (15) in page 4. Although the draft is currently written for minimal path assumption, I am currently failing for both minimal and maximum path assumptions. (Since GiR use same number of words across all documents, maximal should be fine and it should pass)

hannawallach commented 6 years ago

I will have to look tomorrow as I only have phone access today so I can't pull from the repo. Basically I'm wondering if you are doing the same integrations/approximations in the generative process as in inference. I'd try that.

On Feb 19, 2018 6:42 PM, "Bomin Kim" notifications@github.com wrote:

You can look at paper/icml2018_style/IPTM_ICML2.pdf. Generating process is in Section 2.2, and the inference equation is Equation (15) in page 4. Although the draft is currently written for minimal path assumption, I am currently failing for both minimal and maximum path assumptions. (Since GiR use same number of words across all documents, maximal should be fine and it should pass)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-366774887, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7tdw6_7CkcqR1yK2YLDYOumNq6wDks5tWcCggaJpZM4SK117 .

bomin8319 commented 6 years ago

Generating from collapsed LDA equations definitely helped, but there exists remaining issue.

1) Now having two hierarchy with uniform base for interaction pattern-specific topic distribution (i.e., m_c ~ Dir(\alpha1, u) and \thetad ~ Dir(\alpha, m{c_d})) passes GiR for both maximal and minimal path assumption---which did not pass with non-collapsed generating process.

2) However, when directly follow cluster LDA and assign three hierarchy with additional layer representing corpus-wide topic distribution (i.e., m ~ Dir(\alpha0, u), m_c ~ Dir(\alpha1, m) and \thetad ~ Dir(\alpha, m{c_d})), it still fails GiR---Backward samplers concentrated on few dominant topics. I used Equation (15) in /paper/icml2018_style/IPTM_ICML2.pdf for both generative process and inference, and nothing (both equation and code) seems to be wrong. Maybe I should not directly generate from the entirely-collapsed equation when I have this additional hierarchy?

Another related question is, isn't the overall corpus-wide topic distribution (m in #2 above) controlled by the distribution of interaction pattern assignments (or clusters) across the documents? If that is the case, it may not be necessary to use three hierarchy instead of two---in other words, m will be the weighted average of m_c across c=1,...,C so assuming uniform base for m_c would be fine...?

hannawallach commented 6 years ago

I don't understand your related question.

On Feb 23, 2018 12:02 PM, "Bomin Kim" notifications@github.com wrote:

Generating from collapsed LDA equations definitely helped, but there exists remaining issue.

1.

Now having two hierarchy with uniform base for interaction pattern-specific topic distribution (i.e., m_c ~ Dir(\alpha1, u) and \thetad ~ Dir(\alpha, m{c_d})) passes GiR for both maximal and minimal path assumption---which did not pass with non-collapsed generating process. 2.

However, when directly follow cluster LDA and assign three hierarchy with additional layer representing corpus-wide topic distribution (i.e., m ~ Dir(\alpha0, u), m_c ~ Dir(\alpha1, m) and \thetad ~ Dir(\alpha, m{c_d})), it still fails GiR---Backward samplers concentrated on few dominant topics. I used Equation (15) in /paper/icml2018_style/IPTM_ICML2.pdf for both generative process and inference, and nothing (both equation and code) seems to be wrong. Maybe I should not directly generate from the entirely-collapsed equation when I have this additional hierarchy?

Another related question is, isn't the overall corpus-wide topic distribution (m in #2 above) controlled by the distribution of interaction pattern assignments (or clusters) across the documents? If that is the case, it may not be necessary to use #2 instead of #1 https://github.com/desmarais-lab/IPTM/issues/1---in other words, m will be the weighted average of m_c across c=1,...,C so assuming uniform base for m_c would be fine...?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368071119, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7mWc-uYDjPQrPZA5UJV3psZWQ5Liks5tXu8sgaJpZM4SK117 .

hannawallach commented 6 years ago

Can you attach the PDF? I'm on my phone today and can't pull from the repo.

On Feb 23, 2018 12:02 PM, "Bomin Kim" notifications@github.com wrote:

Generating from collapsed LDA equations definitely helped, but there exists remaining issue.

1.

Now having two hierarchy with uniform base for interaction pattern-specific topic distribution (i.e., m_c ~ Dir(\alpha1, u) and \thetad ~ Dir(\alpha, m{c_d})) passes GiR for both maximal and minimal path assumption---which did not pass with non-collapsed generating process. 2.

However, when directly follow cluster LDA and assign three hierarchy with additional layer representing corpus-wide topic distribution (i.e., m ~ Dir(\alpha0, u), m_c ~ Dir(\alpha1, m) and \thetad ~ Dir(\alpha, m{c_d})), it still fails GiR---Backward samplers concentrated on few dominant topics. I used Equation (15) in /paper/icml2018_style/IPTM_ICML2.pdf for both generative process and inference, and nothing (both equation and code) seems to be wrong. Maybe I should not directly generate from the entirely-collapsed equation when I have this additional hierarchy?

Another related question is, isn't the overall corpus-wide topic distribution (m in #2 above) controlled by the distribution of interaction pattern assignments (or clusters) across the documents? If that is the case, it may not be necessary to use #2 instead of #1 https://github.com/desmarais-lab/IPTM/issues/1---in other words, m will be the weighted average of m_c across c=1,...,C so assuming uniform base for m_c would be fine...?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368071119, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7mWc-uYDjPQrPZA5UJV3psZWQ5Liks5tXu8sgaJpZM4SK117 .

hannawallach commented 6 years ago

Lastly, to me this sounds like there's a subtle bug somewhere....

On Feb 23, 2018 12:02 PM, "Bomin Kim" notifications@github.com wrote:

Generating from collapsed LDA equations definitely helped, but there exists remaining issue.

1.

Now having two hierarchy with uniform base for interaction pattern-specific topic distribution (i.e., m_c ~ Dir(\alpha1, u) and \thetad ~ Dir(\alpha, m{c_d})) passes GiR for both maximal and minimal path assumption---which did not pass with non-collapsed generating process. 2.

However, when directly follow cluster LDA and assign three hierarchy with additional layer representing corpus-wide topic distribution (i.e., m ~ Dir(\alpha0, u), m_c ~ Dir(\alpha1, m) and \thetad ~ Dir(\alpha, m{c_d})), it still fails GiR---Backward samplers concentrated on few dominant topics. I used Equation (15) in /paper/icml2018_style/IPTM_ICML2.pdf for both generative process and inference, and nothing (both equation and code) seems to be wrong. Maybe I should not directly generate from the entirely-collapsed equation when I have this additional hierarchy?

Another related question is, isn't the overall corpus-wide topic distribution (m in #2 above) controlled by the distribution of interaction pattern assignments (or clusters) across the documents? If that is the case, it may not be necessary to use #2 instead of #1 https://github.com/desmarais-lab/IPTM/issues/1---in other words, m will be the weighted average of m_c across c=1,...,C so assuming uniform base for m_c would be fine...?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368071119, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7mWc-uYDjPQrPZA5UJV3psZWQ5Liks5tXu8sgaJpZM4SK117 .

bomin8319 commented 6 years ago

IPTM_ICML2.pdf

Just to be clear, here is how I generate z's.

initialize N_dk = 0, N_kc = 0, and N_k = 0
generate topics as below for (d in 1:D){ for (n in 1:N_d) { zdn ~ Equation (15) N{d zdn} +=1 N{z_dn cd} +=1 N{z_dn} +=1 } }

hannawallach commented 6 years ago

Just for maximal, right? This incrementing procedure isn't right for minimal.

On Feb 23, 2018 12:11 PM, "Bomin Kim" notifications@github.com wrote:

IPTM_ICML2.pdf https://github.com/desmarais-lab/IPTM/files/1752470/IPTM_ICML2.pdf

Just to be clear, here is how I generate z's.

initialize N_dk = 0, N_kc = 0, and N_k = 0 for (d in 1:D){ for (n in 1:N_d) { zdn ~ Equation (15) N{d zdn} +=1 N{z_dn cd} +=1 N{z_dn} +=1 } }

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368074032, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7meavz5wRCjs0st4CzGjsNXlVnnMks5tXvFMgaJpZM4SK117 .

bomin8319 commented 6 years ago

Yes. Just for the maximal!

hannawallach commented 6 years ago

And even maximal doesn't pass?

On Feb 23, 2018 12:14 PM, "Bomin Kim" notifications@github.com wrote:

Yes. Just for the maximal!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368075046, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7vZf8QhXm5c6pL2R3uv1cW-465M6ks5tXvIGgaJpZM4SK117 .

bomin8319 commented 6 years ago

Yes, both maximal and minimal fail. After generating z's as above, I infer them as below:

for (iter in 1:Niter) { for (d in 1:D) { for (n in 1:Nd) { N{d zdn} -=1 N{z_dn cd} -=1 N{z_dn} -=1 zdn ~ Equation (15) #new topic assignment N{d zdn} +=1 N{z_dn cd} +=1 N{z_dn} +=1
} } }

and compare N_k from forward and backward using GiR plots which looks like below. (document-IP and token-word distribution plots become completely "pass" when I shut down inference for Z's). GiRplot.pdf

hannawallach commented 6 years ago

I just don't get how this can work with the two level hierarchy but not 3. I think there must be a bug. When you do the two level hierarchy ate you using different code to when you're using the 3 level hierarchy?

On Feb 23, 2018 2:00 PM, "Bomin Kim" notifications@github.com wrote:

Yes, both maximal and minimal fail. After generating z's as above, I infer them as below:

for (iter in 1:Niter) {

for (d in 1:D) { for (n in 1:Nd) { N{d zdn} -=1 N{z_dn cd} -=1 N{z_dn} -=1 z_dn ~ Equation (15) #new topic assignment

N_{d zdn} +=1 N{z_dn cd} +=1 N{z_dn} +=1 } } }

and compare N_k from forward and backward using GiR plots which looks like below. (document-IP and token-word distribution plots become completely "pass" when I shut down inference for Z's). GiRplot.pdf https://github.com/desmarais-lab/IPTM/files/1752802/GiRplot.pdf

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368106357, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7l9_-HiIHxTaOK7nsLsHYB0job1Cks5tXwq1gaJpZM4SK117 .

hannawallach commented 6 years ago

One idea: set alpha0 in your 3 level code so big that the model effectively bypasses the corpus-level counts. Does it pass?

On Feb 23, 2018 2:03 PM, "Hanna Wallach" hanna@dirichlet.net wrote:

I just don't get how this can work with the two level hierarchy but not 3. I think there must be a bug. When you do the two level hierarchy ate you using different code to when you're using the 3 level hierarchy?

On Feb 23, 2018 2:00 PM, "Bomin Kim" notifications@github.com wrote:

Yes, both maximal and minimal fail. After generating z's as above, I infer them as below:

for (iter in 1:Niter) {

for (d in 1:D) { for (n in 1:Nd) { N{d zdn} -=1 N{z_dn cd} -=1 N{z_dn} -=1 z_dn ~ Equation (15) #new topic assignment

N_{d zdn} +=1 N{z_dn cd} +=1 N{z_dn} +=1 } } }

and compare N_k from forward and backward using GiR plots which looks like below. (document-IP and token-word distribution plots become completely "pass" when I shut down inference for Z's). GiRplot.pdf https://github.com/desmarais-lab/IPTM/files/1752802/GiRplot.pdf

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368106357, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7l9_-HiIHxTaOK7nsLsHYB0job1Cks5tXwq1gaJpZM4SK117 .

bomin8319 commented 6 years ago

Yes, I use the same code for the level 2 and level 3 hierarchy. Only difference is that I replace the fraction (N_k + alpha0/K)/(N+alpha0) by 1/K.

I just tried to set alpha0 bigger and apparently it gets closer to "pass". With alpha0 = 1000, it passed.

hannawallach commented 6 years ago

Okay. Are you sampling the alphas? It sounds like you are not?

On Feb 23, 2018 2:17 PM, "Bomin Kim" notifications@github.com wrote:

Yes, I use the same code for the level 2 and level 3 hierarchy. Only difference is that I replace the fraction (N_k + alpha0/K)/(N+alpha0) by 1/K.

I just tried to set alpha0 bigger and apparently it gets closer to "pass". With alpha0 = 1000, it passed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368111232, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7m966RdCMXppbda1bnV3f-0bW0Fkks5tXw68gaJpZM4SK117 .

bomin8319 commented 6 years ago

No. All alpha's are treated as hyperparameters. Should we embed sampling steps for alphas then?

hannawallach commented 6 years ago

Here's what it sounds like to me: the counts at the top level are very big. (Print them to get a sense of the magnitude.) If you have small alphas at that level, you're putting a massive amount of weight on the top-level counts and so I imagine you're getting stuck in a shitty rich get richer scenario and not getting out of it.

On Feb 23, 2018 2:17 PM, "Hanna Wallach" hanna@dirichlet.net wrote:

Okay. Are you sampling the alphas? It sounds like you are not?

On Feb 23, 2018 2:17 PM, "Bomin Kim" notifications@github.com wrote:

Yes, I use the same code for the level 2 and level 3 hierarchy. Only difference is that I replace the fraction (N_k + alpha0/K)/(N+alpha0) by 1/K.

I just tried to set alpha0 bigger and apparently it gets closer to "pass". With alpha0 = 1000, it passed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368111232, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7m966RdCMXppbda1bnV3f-0bW0Fkks5tXw68gaJpZM4SK117 .

hannawallach commented 6 years ago

Probably. (What values are you setting them to?)

On Feb 23, 2018 2:19 PM, "Bomin Kim" notifications@github.com wrote:

No. All alpha's are treated as hyperparameters. Should we embed sampling steps for alphas then?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368111909, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7os-74d3LQ9LIjoCjTD225r2Isrwks5tXw9CgaJpZM4SK117 .

bomin8319 commented 6 years ago

Totally agree with you on "getting stuck in a shitty rich get richer scenario". The reason 2 level worked fine is it always lowers richer probability and increases poorer ones. Similarly when we use huge alpha0, it gets closer to 2 level so it passed.

I so far varied alphas from 5 to 50 (I thought this should pass no matter what alphas are) with different combinations, but now I realize that was not big enough.

hannawallach commented 6 years ago

I'd sample the alphas.

On Feb 23, 2018 2:29 PM, "Bomin Kim" notifications@github.com wrote:

Totally agree with you on "getting stuck in a shitty rich get richer scenario". The reason 2 level worked fine is it always lowers richer probability and increases poorer ones. Similarly when we use huge alpha0, it gets closer to 2 level so it passed.

I so far varied alphas from 5 to 50 (I thought this should pass no matter what alphas are) with different combinations, but now I realize that was not big enough.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368114606, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7qFTtd0Mt1brqCfQ0j_dBVQ2h4Oqks5tXxF9gaJpZM4SK117 .

hannawallach commented 6 years ago

I'm also still not convinced that there isn't a bug somewhere.

On Feb 23, 2018 2:29 PM, "Bomin Kim" notifications@github.com wrote:

Totally agree with you on "getting stuck in a shitty rich get richer scenario". The reason 2 level worked fine is it always lowers richer probability and increases poorer ones. Similarly when we use huge alpha0, it gets closer to 2 level so it passed.

I so far varied alphas from 5 to 50 (I thought this should pass no matter what alphas are) with different combinations, but now I realize that was not big enough.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368114606, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7qFTtd0Mt1brqCfQ0j_dBVQ2h4Oqks5tXxF9gaJpZM4SK117 .

bomin8319 commented 6 years ago

I seems like alpha1 (contribution of the counts at the middle level) also needed to be bigger. Both maximal and minimal passes with (alpha, alpha1, alpha0) = (5, 50, 100), so it may need sampling of alphas to estimate how much each level should be accounted for. Before I work on that, I will ask Bruce to double check my R code in hope of finding any bugs.

hannawallach commented 6 years ago

(I started this before your most recent email.)

Replying quickly from my computer.

I'm not convinced that there's not a bug.
You can think of each alpha as a pseudocount. If the count that you're adding an alpha to is N_d -- i.e., a document length -- then this tells you something about the value of the alpha that you want. In other words, you likely want something that is not substantially larger or substantially smaller than N_d. Ditto if the count that you're adding an alpha to is Nc -- here, you want something that's roughly comparable to the number of tokens associated with a cluster. And ditto for N. at the top level. Since N_. is the total number of tokens in the corpus, it will need to be muuuuuuuch larger than the kind of alpha values that are suitable for the N_d level (which is the level we're usually working at with LDA).

On Fri, Feb 23, 2018 at 2:30 PM, Hanna Wallach hanna@dirichlet.net wrote:

I'm also still not convinced that there isn't a bug somewhere.

On Feb 23, 2018 2:29 PM, "Bomin Kim" notifications@github.com wrote:

Totally agree with you on "getting stuck in a shitty rich get richer scenario". The reason 2 level worked fine is it always lowers richer probability and increases poorer ones. Similarly when we use huge alpha0, it gets closer to 2 level so it passed.

I so far varied alphas from 5 to 50 (I thought this should pass no matter what alphas are) with different combinations, but now I realize that was not big enough.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368114606, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7qFTtd0Mt1brqCfQ0j_dBVQ2h4Oqks5tXxF9gaJpZM4SK117 .

-- hanna wallach http://dirichlet.net/

hannawallach commented 6 years ago

I'd definitely sample the alphas. It's even less easy to figure out good alpha values for the minimal path assumption.

On Fri, Feb 23, 2018 at 3:47 PM, Hanna Wallach hanna@dirichlet.net wrote:

(I started this before your most recent email.)

Replying quickly from my computer.

I'm not convinced that there's not a bug.

You can think of each alpha as a pseudocount. If the count that you're adding an alpha to is N_d -- i.e., a document length -- then this tells you something about the value of the alpha that you want. In other words, you likely want something that is not substantially larger or substantially smaller than N_d. Ditto if the count that you're adding an alpha to is Nc -- here, you want something that's roughly comparable to the number of tokens associated with a cluster. And ditto for N. at the top level. Since N_. is the total number of tokens in the corpus, it will need to be muuuuuuuch larger than the kind of alpha values that are suitable for the N_d level (which is the level we're usually working at with LDA).

On Fri, Feb 23, 2018 at 2:30 PM, Hanna Wallach hanna@dirichlet.net wrote:

I'm also still not convinced that there isn't a bug somewhere.

On Feb 23, 2018 2:29 PM, "Bomin Kim" notifications@github.com wrote:

Totally agree with you on "getting stuck in a shitty rich get richer scenario". The reason 2 level worked fine is it always lowers richer probability and increases poorer ones. Similarly when we use huge alpha0, it gets closer to 2 level so it passed.

I so far varied alphas from 5 to 50 (I thought this should pass no matter what alphas are) with different combinations, but now I realize that was not big enough.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368114606, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7qFTtd0Mt1brqCfQ0j_dBVQ2h4Oqks5tXxF9gaJpZM4SK117 .

-- hanna wallach http://dirichlet.net/

-- hanna wallach http://dirichlet.net/

hannawallach commented 6 years ago

If there isn't a bug then the issue is mixing, caused by having shitty alpha values that mean it's taking waaaay too long to mix.

On Fri, Feb 23, 2018 at 3:47 PM, Hanna Wallach hanna@dirichlet.net wrote:

I'd definitely sample the alphas. It's even less easy to figure out good alpha values for the minimal path assumption.

On Fri, Feb 23, 2018 at 3:47 PM, Hanna Wallach hanna@dirichlet.net wrote:

(I started this before your most recent email.)

Replying quickly from my computer.

I'm not convinced that there's not a bug.

You can think of each alpha as a pseudocount. If the count that you're adding an alpha to is N_d -- i.e., a document length -- then this tells you something about the value of the alpha that you want. In other words, you likely want something that is not substantially larger or substantially smaller than N_d. Ditto if the count that you're adding an alpha to is Nc -- here, you want something that's roughly comparable to the number of tokens associated with a cluster. And ditto for N. at the top level. Since N_. is the total number of tokens in the corpus, it will need to be muuuuuuuch larger than the kind of alpha values that are suitable for the N_d level (which is the level we're usually working at with LDA).

On Fri, Feb 23, 2018 at 2:30 PM, Hanna Wallach hanna@dirichlet.net wrote:

I'm also still not convinced that there isn't a bug somewhere.

On Feb 23, 2018 2:29 PM, "Bomin Kim" notifications@github.com wrote:

Totally agree with you on "getting stuck in a shitty rich get richer scenario". The reason 2 level worked fine is it always lowers richer probability and increases poorer ones. Similarly when we use huge alpha0, it gets closer to 2 level so it passed.

I so far varied alphas from 5 to 50 (I thought this should pass no matter what alphas are) with different combinations, but now I realize that was not big enough.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368114606, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7qFTtd0Mt1brqCfQ0j_dBVQ2h4Oqks5tXxF9gaJpZM4SK117 .

-- hanna wallach http://dirichlet.net/

-- hanna wallach http://dirichlet.net/

-- hanna wallach http://dirichlet.net/

bomin8319 commented 6 years ago

Thanks for suggestions! #2 definitely explains why it worked out with (alpha, alpha1, alpha0) = (5, 50, 100). I will check out the bugs first and (no matter there is a bug or not) work on sampling of the alpha since we are anyway going to use the minimal path assumption.

aschein commented 6 years ago

Have you implemented Schein testing? This test will fail if there is a software bug but not if a correctly implemented sampler is failing to mix. It also takes many fewer samples to detect a bug.

Sent from my iPhone

On Feb 23, 2018, at 3:58 PM, Bomin Kim notifications@github.com wrote:

Thanks for suggestions! #2 definitely explains why it worked out with (alpha, alpha1, alpha0) = (5, 50, 100). I will check out the bugs first and (no matter there is a bug or not) work on sampling of the alpha since we are anyway going to use the minimal path assumption.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

hannawallach commented 6 years ago

+1

On Feb 25, 2018 3:15 PM, "Aaron Schein" notifications@github.com wrote:

Have you implemented Schein testing? This test will fail if there is a software bug but not if a correctly implemented sampler is failing to mix. It also takes many fewer samples to detect a bug.

Sent from my iPhone

On Feb 23, 2018, at 3:58 PM, Bomin Kim notifications@github.com wrote:

Thanks for suggestions! #2 definitely explains why it worked out with (alpha, alpha1, alpha0) = (5, 50, 100). I will check out the bugs first and (no matter there is a bug or not) work on sampling of the alpha since we are anyway going to use the minimal path assumption.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368340713, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7qBqNA6FUAGI2iZh0W7_XS0a49iSks5tYb90gaJpZM4SK117 .

bomin8319 commented 6 years ago

Yes, what I have been working on so far is actually Schein testing (to prevent from mixing issue). Schein test passes with small number of outer iterations (thus only few steps away from true values), but it fails as I increase the size of outer iterations.

hannawallach commented 6 years ago

Okay. Sounds like there is a bug somewhere. When you clamp various parts of the model, which bita pass/fail?

On Feb 25, 2018 3:31 PM, "Bomin Kim" notifications@github.com wrote:

Yes, what I have been working on so far is actually Schein testing (to prevent from mixing issue). Schein test passes with small number of outer iterations (thus only few steps away from true values), but it fails as I increase the size of outer iterations.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/IPTM/issues/1#issuecomment-368341970, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1T7lThwxOBEnqH6QsOwcBsm4WXkWz0ks5tYcM9gaJpZM4SK117 .

bomin8319 commented 6 years ago

It was only topic distribution {Nk}{k=1}^K that failed Schein test when I clamped the rest of parts of the IPTM. So, I wrote a separate code for cluster LDA (without other IPTM variables) which iterates the generative process and inference for Z's only, assuming the cluster assignments are known. What I have illustrated so far (richer-get-richer issue) was based on this only-cluster LDA version of Schein test. Now I am trying to find a bug in this simpler version which may fix the same issue in the IPTM.

bomin8319 commented 6 years ago

Sorry for being late, but I attached the derivation of cluster LDA. Section 2 results in the IPTM's current sampling equation of z_dn (I followed Hanna's paper ---Rethinking LDA: Why Priors Matter), but I put NOTE for one part that I am not sure about. Section 3 is the alternative approach to directly integrate out the base measures which ended up something similar to hierarchical Dirichlet process, although I failed to finish the derivation... I will try to work on this once more before our meeting, and we can go over this together tomorrow. clusterLDA.pdf

desmarais-lab / IPTM

GiR issue with new version of IPTM #1