adjidieng / ETM

Topic Modeling in Embedding Spaces
MIT License
538 stars 126 forks source link

Preprocessing #1

Open tutubalinaev opened 5 years ago

tutubalinaev commented 5 years ago

Could you please upload a preprocessing script that creates all files from data/./? The files are: bow_tr_counts.mat bow_tr_tokens.mat bow_ts_counts.mat bow_ts_h1_counts.mat bow_ts_h1_tokens.mat bow_ts_h2_counts.mat bow_ts_h2_tokens.mat bow_ts_tokens.mat bow_va_counts.mat bow_va_tokens.mat vocab.pkl

Thank you in advance!

Aalisha commented 5 years ago

Would it be possible to explain in brief how these data files /data/./ were created from NewsGroup Dataset and what does the data in each of these files represent?

Thanking you!

tutubalinaev commented 5 years ago

@adjidieng could you please comment my request?

Mandark27 commented 4 years ago

token file contains all the tokens or words present in a review. you can use vocab.pkl to see the words.

Count file tells you the count of each of that token in that review.

adjidieng commented 4 years ago

Hi everyone,

I am posting below an email that we wrote that replies to this question.

Thanks.

Thank you very much for your interest in the ETM model! We're glad you're looking into it.

The formatting of the data is as follows. All data files are in a bag-of-words format. Their names are bow_XX_YY.mat, where XX = {tr, ts, ts_h1, ts_h2, va} # training, test, test(first half of each doc), test(second half of each doc), validation YY = {tokens, counts} # content of the file: tokens or counts Each file contains a list of documents. That is, each list is of the form [doc_1, doc_2, ..., doc_N]. Each element doc_i is itself a list with integers. The integers represent either the vocabulary terms (they are 0-indexed) for the "tokens" files, or the word counts for the "counts" files. For example, if doc_1=[0, 14, 17] in the file ending in "tokens.mat" and doc_1=[3, 1, 2] in the file ending in "counts.mat", that means that term 0 occurs 3 times in the document, term 14 appears once, and term 17 appears twice.

To be more specific, here is how we created the bow_tr_YY.mat files from bow_tr (which is a scipy sparse matrix in CSR format containing the bag-of-words representation of all documents in the training set):

    def split_bow(bow_in, n_docs):
        indices = [[w for w in bow_in[doc,:].indices] for doc in range(n_docs)]
        counts = [[c for c in bow_in[doc,:].data] for doc in range(n_docs)]
        return indices, counts

    bow_tr_tokens, bow_tr_counts = split_bow(bow_tr, n_docs_tr)
    savemat('bow_tr_tokens', {'tokens': bow_tr_tokens}, do_compression=True)
    savemat('bow_tr_counts', {'counts': bow_tr_counts}, do_compression=True)

Finally, vocab.pkl is simply a list containing the strings corresponding to the vocabulary terms. We created this file using

pickle.dump(vocab, f)

We hope that helps!

thousandoaks commented 4 years ago

Hi everyone,

I am posting below an email that we wrote that replies to this question.

Thanks.

Thank you very much for your interest in the ETM model! We're glad you're looking into it.

The formatting of the data is as follows. All data files are in a bag-of-words format. Their names are bow_XX_YY.mat, where XX = {tr, ts, ts_h1, ts_h2, va} # training, test, test(first half of each doc), test(second half of each doc), validation YY = {tokens, counts} # content of the file: tokens or counts Each file contains a list of documents. That is, each list is of the form [doc_1, doc_2, ..., doc_N]. Each element doc_i is itself a list with integers. The integers represent either the vocabulary terms (they are 0-indexed) for the "tokens" files, or the word counts for the "counts" files. For example, if doc_1=[0, 14, 17] in the file ending in "tokens.mat" and doc_1=[3, 1, 2] in the file ending in "counts.mat", that means that term 0 occurs 3 times in the document, term 14 appears once, and term 17 appears twice.

To be more specific, here is how we created the bow_tr_YY.mat files from bow_tr (which is a scipy sparse matrix in CSR format containing the bag-of-words representation of all documents in the training set):

    def split_bow(bow_in, n_docs):
        indices = [[w for w in bow_in[doc,:].indices] for doc in range(n_docs)]
        counts = [[c for c in bow_in[doc,:].data] for doc in range(n_docs)]
        return indices, counts

    bow_tr_tokens, bow_tr_counts = split_bow(bow_tr, n_docs_tr)
    savemat('bow_tr_tokens', {'tokens': bow_tr_tokens}, do_compression=True)
    savemat('bow_tr_counts', {'counts': bow_tr_counts}, do_compression=True)

Finally, vocab.pkl is simply a list containing the strings corresponding to the vocabulary terms. We created this file using

pickle.dump(vocab, f)

We hope that helps!

Thanks for this explanation, it really helps ! I am aware that creating bag-of-words is out of the scope of this project, however it would really help any reference on how to transform documents onto bag-of-words as this usually requires highly subjective transformations (e.g. tokenization, lemmatizing, stop words, etc).

thanks a lot David L.

adjidieng commented 4 years ago

Hi There,

We just added the scripts to pre-process a dataset in the repo. Please check that out and let us know if you have other questions.

arnicas commented 4 years ago

Hi - loved your scripts. They help a lot. The only minor bug is that the output has to have .mat at the end, and at least the 20ng one didn't write them out that way. Minor point!