Questions regarding "q(\eta_t | \eta_{1:t-1}, \tilde{w}_t)"

Thanks for the interesting paper and great repository. I have a few clarification questions regarding the method and the code that I was wondering if you could help me with. Thanks in advance!

In Section 4.2 of the (arXiv version) paper, it states that "We choose a Gaussian distribution q(\eta_t | \eta_{1:t-1}, \tilde{w}_t), whose mean and covariance are given by the output of the LSTM." However, in this repository, the LSTM takes in only \tilde{w}_t as input, but not \eta_{1:t-1} (https://github.com/adjidieng/DETM/blob/master/detm.py#L130) Rather, \eta_{t-1} is only used AFTER LSTM (https://github.com/adjidieng/DETM/blob/master/detm.py#L146) through concatenation with the LSTM output. In this way, the LSTM can only capture the temporal dependency of \tilde{w}, but not the temporal dependency of \eta. I probably missed something, but I wonder if you could please help me understand the intuition behind this. Thank you.
In D-LDA (Dynamic Topic Models, Blei & Lafferty 2006) paper, the method is able to perform "future" prediction (Fig 5 in the D-LDA paper). On the other hand, with DETM, I wonder if the dependency of \tilde{w}_t in q(\eta_t | \eta_{1:t-1}, \tilde{w}_t) disables DETM from doing future prediction, since it uses "words from the future time step" (\tilde{w}_t).

Thank you!

adjidieng / DETM

Questions regarding "q(\eta_t | \eta_{1:t-1}, \tilde{w}_t)" #13