adjidieng / ETM

Topic Modeling in Embedding Spaces
MIT License
538 stars 126 forks source link

a bug in test dataset splitting #33

Open nobrowning opened 3 years ago

nobrowning commented 3 years ago

I noticed that there is bug in the preprocessing code for 20ng(scripts/data_20ng.py) https://github.com/adjidieng/ETM/blob/52b090b5b2fd6fcecc6d0b2c55d03a2d893b729d/scripts/data_20ng.py#L88

missing the idx_permute index convert

Littleele commented 1 year ago

in line 91 idx_permute = np.random.permutation(num_docs_tr).astype(int) the idx_permute is num_docs_tr size, so for test set there is no need to add the index convert