areejokaili / topic_labelling

22 stars 6 forks source link

input data for preprocess #10

Open itaim opened 3 years ago

itaim commented 3 years ago

Hi!, If I understand correctly from reading the other closed issues in order to run inference on my own set of topics I need to first run preprocess on my topics data file. But preprocess expects in_data_path='/Users/areej/Desktop/wiki_extract/wiki_title_topn_doc/'. Where can I get this data from or am I missing something and I can run it without?

areejokaili commented 3 years ago

This is written in red.

Hi @itaim, Thanks for your interest in our work!

You have two options:

  1. [Option 1] Preprocess your own training data using preprocess_wiki.py. The data needs to be in a csv file with 3 columns such as follows:
article's title article's top-n words article's first couple of sentences
window decoration window titlebar managers bar buttons ... In graphical user interfaces, the window decoration is ...

please refer to the paper for details how we extracted this training data from wikipedia. The resulted in processed data will be stored in data/wiki_tfidf/ or data/wiki_sent/.

  1. [Option 2] Use the already provided processed data in data/wiki_tfidf and data/wiki_sent to train and generate labels for your topics.

Hope this makes it clearer.

Areej