Open tutubalinaev opened 5 years ago
Would it be possible to explain in brief how these data files /data/./ were created from NewsGroup Dataset and what does the data in each of these files represent?
Thanking you!
@adjidieng could you please comment my request?
token file contains all the tokens or words present in a review. you can use vocab.pkl to see the words.
Count file tells you the count of each of that token in that review.
Hi everyone,
I am posting below an email that we wrote that replies to this question.
Thank you very much for your interest in the ETM model! We're glad you're looking into it.
The formatting of the data is as follows. All data files are in a bag-of-words format. Their names are bow_XX_YY.mat
, where
XX = {tr, ts, ts_h1, ts_h2, va} # training, test, test(first half of each doc), test(second half of each doc), validation
YY = {tokens, counts} # content of the file: tokens or counts
Each file contains a list of documents. That is, each list is of the form [doc_1, doc_2, ..., doc_N]. Each element doc_i is itself a list with integers. The integers represent either the vocabulary terms (they are 0-indexed) for the "tokens" files, or the word counts for the "counts" files. For example, if doc_1=[0, 14, 17] in the file ending in "tokens.mat" and doc_1=[3, 1, 2] in the file ending in "counts.mat", that means that term 0 occurs 3 times in the document, term 14 appears once, and term 17 appears twice.
To be more specific, here is how we created the bow_tr_YY.mat
files from bow_tr
(which is a scipy sparse matrix in CSR format containing the bag-of-words representation of all documents in the training set):
def split_bow(bow_in, n_docs):
indices = [[w for w in bow_in[doc,:].indices] for doc in range(n_docs)]
counts = [[c for c in bow_in[doc,:].data] for doc in range(n_docs)]
return indices, counts
bow_tr_tokens, bow_tr_counts = split_bow(bow_tr, n_docs_tr)
savemat('bow_tr_tokens', {'tokens': bow_tr_tokens}, do_compression=True)
savemat('bow_tr_counts', {'counts': bow_tr_counts}, do_compression=True)
Finally, vocab.pkl
is simply a list containing the strings corresponding to the vocabulary terms. We created this file using
pickle.dump(vocab, f)
We hope that helps!
Hi everyone,
I am posting below an email that we wrote that replies to this question.
Thanks.
Thank you very much for your interest in the ETM model! We're glad you're looking into it.
The formatting of the data is as follows. All data files are in a bag-of-words format. Their names are
bow_XX_YY.mat
, where XX = {tr, ts, ts_h1, ts_h2, va} # training, test, test(first half of each doc), test(second half of each doc), validation YY = {tokens, counts} # content of the file: tokens or counts Each file contains a list of documents. That is, each list is of the form [doc_1, doc_2, ..., doc_N]. Each element doc_i is itself a list with integers. The integers represent either the vocabulary terms (they are 0-indexed) for the "tokens" files, or the word counts for the "counts" files. For example, if doc_1=[0, 14, 17] in the file ending in "tokens.mat" and doc_1=[3, 1, 2] in the file ending in "counts.mat", that means that term 0 occurs 3 times in the document, term 14 appears once, and term 17 appears twice.To be more specific, here is how we created the
bow_tr_YY.mat
files frombow_tr
(which is a scipy sparse matrix in CSR format containing the bag-of-words representation of all documents in the training set):def split_bow(bow_in, n_docs): indices = [[w for w in bow_in[doc,:].indices] for doc in range(n_docs)] counts = [[c for c in bow_in[doc,:].data] for doc in range(n_docs)] return indices, counts bow_tr_tokens, bow_tr_counts = split_bow(bow_tr, n_docs_tr) savemat('bow_tr_tokens', {'tokens': bow_tr_tokens}, do_compression=True) savemat('bow_tr_counts', {'counts': bow_tr_counts}, do_compression=True)
Finally,
vocab.pkl
is simply a list containing the strings corresponding to the vocabulary terms. We created this file usingpickle.dump(vocab, f)
We hope that helps!
Thanks for this explanation, it really helps ! I am aware that creating bag-of-words is out of the scope of this project, however it would really help any reference on how to transform documents onto bag-of-words as this usually requires highly subjective transformations (e.g. tokenization, lemmatizing, stop words, etc).
thanks a lot David L.
Hi There,
We just added the scripts to pre-process a dataset in the repo. Please check that out and let us know if you have other questions.
Hi - loved your scripts. They help a lot. The only minor bug is that the output has to have .mat at the end, and at least the 20ng one didn't write them out that way. Minor point!
Could you please upload a preprocessing script that creates all files from data/./? The files are: bow_tr_counts.mat bow_tr_tokens.mat bow_ts_counts.mat bow_ts_h1_counts.mat bow_ts_h1_tokens.mat bow_ts_h2_counts.mat bow_ts_h2_tokens.mat bow_ts_tokens.mat bow_va_counts.mat bow_va_tokens.mat vocab.pkl
Thank you in advance!