Ardavans / sHDP

Nonparametric Topic Modeling with Word Vectors
MIT License
73 stars 19 forks source link

Unable to run using a custom dataset #6

Closed austinv11 closed 7 years ago

austinv11 commented 7 years ago

I am attempting to utilize this tool on a custom dataset. This dataset has been created by using #2 as a reference for the format, however it is not running and I'm not sure why. The traceback is:

/Users/a.varela/Downloads/sHDP-master/core/util/stats.py:152: UserWarning: Not sure about sampling vMF, use with caution!!!! 
  warn('Not sure about sampling vMF, use with caution!!!! ')
Traceback (most recent call last):
  File "./runner.py", line 225, in <module>
    HDPRunner(args)
  File "./runner.py", line 132, in HDPRunner
    HDP.meanfield_sgdstep(data, np.array(data).shape[0] / np.float(training_size), rho_t)
  File "/Users/a.varela/Downloads/sHDP-master/HDP/models.py", line 138, in meanfield_sgdstep
    s.meanfieldupdate()
  File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 213, in meanfieldupdate
    self.mf_trans_matrix[self.doc_num,:],self.mf_aBl)
  File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 189, in mf_aBl
    aBl[:,idx] = o.expected_log_likelihood([i[0] for i in self.data]).ravel()
  File "/Users/a.varela/Downloads/sHDP-master/core/core_distributions.py", line 566, in expected_log_likelihood
    return self._Expected_log_partition + self._Expected_kappa*np.array(x).dot(self._Expected_mu)
ValueError: shapes (75,100) and (50,) not aligned: 100 (dim 1) != 50 (dim 0)
Ardavans commented 7 years ago

Hi there,

It seems that you are generating the dataset with word vectors of size 100 but you are not changing the num_dim = 50. Can you try regenerating the dataset with word vectors of size 50?

On Wed, Aug 2, 2017 at 3:40 PM, Austin notifications@github.com wrote:

I am attempting to utilize this tool on a custom dataset. This dataset has been created by using #2 https://github.com/Ardavans/sHDP/issues/2 as a reference for the format, however it is not running and I'm not sure why. The traceback is:

/Users/a.varela/Downloads/sHDP-master/core/util/stats.py:152: UserWarning: Not sure about sampling vMF, use with caution!!!! warn('Not sure about sampling vMF, use with caution!!!! ') Traceback (most recent call last): File "./runner.py", line 225, in HDPRunner(args) File "./runner.py", line 132, in HDPRunner HDP.meanfield_sgdstep(data, np.array(data).shape[0] / np.float(training_size), rho_t) File "/Users/a.varela/Downloads/sHDP-master/HDP/models.py", line 138, in meanfield_sgdstep s.meanfieldupdate() File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 213, in meanfieldupdate self.mf_trans_matrix[self.doc_num,:],self.mf_aBl) File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 189, in mf_aBl aBl[:,idx] = o.expected_log_likelihood([i[0] for i in self.data]).ravel() File "/Users/a.varela/Downloads/sHDP-master/core/core_distributions.py", line 566, in expected_log_likelihood return self._Expected_log_partition + self._Expected_kappa*np.array(x).dot(self._Expected_mu) ValueError: shapes (75,100) and (50,) not aligned: 100 (dim 1) != 50 (dim 0)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Ardavans/sHDP/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkD5cx5oihprM6bJfr_J0V-0GGNDzFWks5sUNCmgaJpZM4Orj4h .

austinv11 commented 7 years ago

Hello, I tried changing num_dim to 100 and I still get this error:

Traceback (most recent call last):
  File "./runner.py", line 225, in <module>
    HDPRunner(args)
  File "./runner.py", line 132, in HDPRunner
    HDP.meanfield_sgdstep(data, np.array(data).shape[0] / np.float(training_size), rho_t)
  File "/Users/a.varela/Downloads/sHDP-master/HDP/models.py", line 138, in meanfield_sgdstep
    s.meanfieldupdate()
  File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 213, in meanfieldupdate
    self.mf_trans_matrix[self.doc_num,:],self.mf_aBl)
  File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 189, in mf_aBl
    aBl[:,idx] = o.expected_log_likelihood([i[0] for i in self.data]).ravel()
  File "/Users/a.varela/Downloads/sHDP-master/core/core_distributions.py", line 566, in expected_log_likelihood
    return self._Expected_log_partition + self._Expected_kappa*np.array(x).dot(self._Expected_mu)
ValueError: shapes (0,) and (100,) not aligned: 0 (dim 0) != 100 (dim 0)

I also tried changing the vectors to size 50 and this happened:

.
.
.Traceback (most recent call last):
  File "./runner.py", line 225, in <module>
    HDPRunner(args)
  File "./runner.py", line 132, in HDPRunner
    HDP.meanfield_sgdstep(data, np.array(data).shape[0] / np.float(training_size), rho_t)
  File "/Users/a.varela/Downloads/sHDP-master/HDP/models.py", line 138, in meanfield_sgdstep
    s.meanfieldupdate()
  File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 213, in meanfieldupdate
    self.mf_trans_matrix[self.doc_num,:],self.mf_aBl)
  File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 189, in mf_aBl
    aBl[:,idx] = o.expected_log_likelihood([i[0] for i in self.data]).ravel()
  File "/Users/a.varela/Downloads/sHDP-master/core/core_distributions.py", line 566, in expected_log_likelihood
    return self._Expected_log_partition + self._Expected_kappa*np.array(x).dot(self._Expected_mu)
ValueError: shapes (0,) and (50,) not aligned: 0 (dim 0) != 50 (dim 0)
Ardavans commented 7 years ago

Can you make sure that you are loading the dataset correctly? Shape 0 for x means that the data is not properly loaded or x is empty.

On Thu, Aug 3, 2017 at 6:58 AM, Austin notifications@github.com wrote:

Hello, I tried changing num_dim to 100 and I still get this error:

Traceback (most recent call last): File "./runner.py", line 225, in HDPRunner(args) File "./runner.py", line 132, in HDPRunner HDP.meanfield_sgdstep(data, np.array(data).shape[0] / np.float(training_size), rho_t) File "/Users/a.varela/Downloads/sHDP-master/HDP/models.py", line 138, in meanfield_sgdstep s.meanfieldupdate() File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 213, in meanfieldupdate self.mf_trans_matrix[self.doc_num,:],self.mf_aBl) File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 189, in mf_aBl aBl[:,idx] = o.expected_log_likelihood([i[0] for i in self.data]).ravel() File "/Users/a.varela/Downloads/sHDP-master/core/core_distributions.py", line 566, in expected_log_likelihood return self._Expected_log_partition + self._Expected_kappa*np.array(x).dot(self._Expected_mu) ValueError: shapes (0,) and (100,) not aligned: 0 (dim 0) != 100 (dim 0)

I also tried changing the vectors to size 50 and this happened:

. . .Traceback (most recent call last): File "./runner.py", line 225, in HDPRunner(args) File "./runner.py", line 132, in HDPRunner HDP.meanfield_sgdstep(data, np.array(data).shape[0] / np.float(training_size), rho_t) File "/Users/a.varela/Downloads/sHDP-master/HDP/models.py", line 138, in meanfield_sgdstep s.meanfieldupdate() File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 213, in meanfieldupdate self.mf_trans_matrix[self.doc_num,:],self.mf_aBl) File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 189, in mf_aBl aBl[:,idx] = o.expected_log_likelihood([i[0] for i in self.data]).ravel() File "/Users/a.varela/Downloads/sHDP-master/core/core_distributions.py", line 566, in expected_log_likelihood return self._Expected_log_partition + self._Expected_kappa*np.array(x).dot(self._Expected_mu) ValueError: shapes (0,) and (50,) not aligned: 0 (dim 0) != 50 (dim 0)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ardavans/sHDP/issues/6#issuecomment-319976917, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkD5eYppq1Obv2a7gF5a2AX4bq0keNgks5sUdIWgaJpZM4Orj4h .

austinv11 commented 7 years ago

Is there a way I can verify that? Because the program does load and states:

{'Nmax': 40, 'gamma': 2.0, 'tau': 0.8, 'mbsize': 10.0, 'kappa_sgd': 0.6, 'dataset': 'test', 'infSeed': 1, 'alpha': 1.0}
Loading the glove dict file....
Main runner ...
num_docs: 5000

I also formatted the data such that the texts pickle is a pickle containing a list with elements of type list representing each document and within each list is a tuple containing (word, occurrences) while the wordvec pickle is a pickle that is a dictionary with keys of type str representing the words and the values for each key is a list containing type str to hold the vector numbers. The only difference is that my vectors aren't normalized, could that be the issue?

Ardavans commented 7 years ago

I'd recommend debugging it with an IDE like PyCharm and verifying whether you have the correct format for your dataset. First run the code with the original dataset and check the format of the loaded data at the line you see the error. Next, try your own dataset and see how different that is from the original dataset in the same line of code.

On Tue, Aug 8, 2017 at 7:54 AM, Austin notifications@github.com wrote:

Is there a way I can verify that? Because the program does load and states:

{'Nmax': 40, 'gamma': 2.0, 'tau': 0.8, 'mbsize': 10.0, 'kappa_sgd': 0.6, 'dataset': 'test', 'infSeed': 1, 'alpha': 1.0} Loading the glove dict file.... Main runner ... num_docs: 5000

I also formatted the data such that the texts pickle is a pickle containing a list with elements of type list representing each document and within each list is a tuple containing (word, occurrences) while the wordvec pickle is a pickle that is a dictionary with keys of type str representing the words and the values for each key is a list containing type str to hold the vector numbers. The only difference is that my vectors aren't normalized, could that be the issue?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ardavans/sHDP/issues/6#issuecomment-320981405, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkD5dQUF2QEXhJoaK7ClIW6JollNdO0ks5sWHa7gaJpZM4Orj4h .

austinv11 commented 7 years ago

Turns out that I had empty document vectors and filtering them out fixed the issue.