adjidieng / ETM

Topic Modeling in Embedding Spaces
MIT License
538 stars 126 forks source link

Confuse about the data loader function #40

Open A11en0 opened 2 years ago

A11en0 commented 2 years ago

Hi, thanks for your wonderful job. But I encounter confusion about the data loader function. Detail as below:

parser.add_argument('--data_path', type=str, default='data/20ng', help='directory containing data')
  1. I can't find any code that refers to the '--data_path' parameter, so why do we need to add it as input in the following command.
python main.py --mode train --dataset 20ng --data_path data/20ng --num_topics 50 --train_embeddings 1 --epochs 1000
  1. How do these two parameters doc_terms_file_name and terms_filename do? I don't understand, even I can't find 'tf_idf_doc_terms_matrix_time_window_1' anywhere (such as the provided dataset directory.)
vocab, training_set, valid, test_1, test_2 = data.get_data(doc_terms_file_name="tf_idf_doc_terms_matrix_time_window_1",
                                                           terms_filename="tf_idf_terms_time_window_1")
liuh236 commented 2 years ago

same question...

lxkkk117 commented 2 years ago

me too, also encounter this problem...

zhaoLLL commented 2 years ago

For the second question, you can find it in file data_espy_tweets.py savemat(path_save.joinpath('tf_idf_doc_terms_matrix_time_window_1'), {"doc_terms_matrix": doc_terms_matrix}) savemat(path_save.joinpath('tf_idf_terms_time_window_1'), {"terms" : terms})

manueltonneau commented 2 years ago

I have the same problem.

@zhaoLLL thanks for your reply but how do the bow_X_tokens.mat and bow_X_counts.mat map to these two TF-IDF matrices?

manueltonneau commented 2 years ago

Since this repo doesn't seem to be curated anymore, I suggest you use another repo I just discovered: https://github.com/lffloyd/embedded-topic-model I was able to use ETM very easily with it.

Littleele commented 1 year ago

same question!