ahoho / kd-topic-models

Repo for EMNLP 2020 paper, "Improving Neural Topic Models using Knowledge Distillation"
https://www.aclweb.org/anthology/2020.emnlp-main.137/
31 stars 4 forks source link

Would you like to post a simple example notebook? #1

Open ShuangNYU opened 3 years ago

ShuangNYU commented 3 years ago

Hi,

Thanks for releasing your code in this repository! It would be exciting to try the neural topic models with Knowledge Distillation.

I was wondering whether you would like to give a simple example as a guidance, perhaps as a jupyter notebook, so that I can know which file I should run first.

Best wishes

ahoho commented 3 years ago

Thanks for your interest in our work! We intend to put together a fuller how-to in the future, but for the time being, here's a rough outline of the necessary steps. Please let us know if you run into problems---I'll compile these into a readme once we're sure this all works.

  1. As of now, you'll need two conda environments to run both the BERT teacher and topic modeling student (which is a modification of Scholar). The environment files are defined in teacher/teacher.yml and scholar/scholar.yml for the teacher and topic model, respectively. For example: conda env create -f teacher/teacher.yml (edit the first line in the yml file if you want to change the name of the resulting environment; the default is transformers28).
  2. We don't have a general-purpose data processing pipeline together, but you can use the IMDb format as a guide:
    
    conda activate scholar
    python data/imdb/download_imdb.py

main preprocessing script

python preprocess_data.py data/imdb/train.jsonlist data/imdb/processed --vocab_size 5000 --test data/imdb/test.jsonlist

create a dev split from the train data

create_dev_split.py

3. Run the teacher model. Below is what we used for IMDb

conda activate transformers28

python teacher/bert_reconstruction.py \ --input-dir ./data/imdb/processed-dev \ --output-dir ./data/imdb/processed-dev/logits \ --do-train \ --evaluate-during-training \ --truncate-dev-set-for-eval 120 \ --logging-steps 200 \ --save-steps 1000 \ --num-train-epochs 6 \ --seed 42 \ --num-workers 4 \ --batch-size 20 \ --gradient-accumulation-steps 8 \ --document-split-pooling mean-over-logits

4. Collect the logits from the teacher model (the `--checkpoint-folder-pattern` argument accepts grub pattern matching in case you want to create logits for different stages of training; be sure to enclose in double quotes `"`)

conda activate transformers28

python teacher/bert_reconstruction.py \ --output-dir ./data/imdb/processed-dev/logits \ --seed 42 \ --num-workers 6 \ --get-reps \ --checkpoint-folder-pattern "checkpoint-9000" \ --save-doc-logits \ --no-dev

5. Run the topic model. This is the messiest part of the code and we will be cleaning it up, but in the meantime, my apologies for all the extraneous/obscure arguments. 

conda activate scholar

python scholar/run_scholar.py \ ./data/imdb/processed-dev \ --dev-metric npmi \ -k 50 \ --epochs 500 \ --patience 500 \ --batch-size 200 \ --background-embeddings \ --device 0 \ --dev-prefix dev \ -lr 0.002 \ --alpha 0.5 \ --eta-bn-anneal-step-const 0.25 \ --doc-reps-dir ./data/imdb/processed-dev/logits/checkpoint-9000/doc_logits \ --use-doc-layer \ --no-bow-reconstruction-loss \ --doc-reconstruction-weight 0.5 \ --doc-reconstruction-temp 1.0 \ --doc-reconstruction-logit-clipping 10.0 \ -o ./outputs/imdb

ShuangNYU commented 3 years ago

Hi Alexander! When using python preprocess_data.py data/imdb/train.jsonlist data/imdb/processed --vocab_size 5000 --test data/imdb/test.jsonlist, what should I use for the option '--label' or '--label_dict'? I got an error when using '--label '1,2,3,4' ' Traceback (most recent call last): File "../data/imdb/preprocess_data.py", line 661, in main(sys.argv) File "../data/imdb/preprocess_data.py", line 181, in main preprocess_data( File "../data/imdb/preprocess_data.py", line 297, in preprocess_data for ids, tokens, labels in pool.imap(partial(_process_item, **kwargs), group): File "/usr/local/Cellar/python@3.8/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 868, in next raise value KeyError: '1'

ahoho commented 3 years ago

Hm, I thought we had it set up so that the labels were optional, but in the meantime I think you can run --labels rating to get it to create labels for the associated review score (you aren't obligated to use them later).

@Pranav-Goel should also be able to help with this. It's possible I'm missing something, or we just need to update the code.

Pranav-Goel commented 3 years ago

Hi, sorry I am getting to this quite late. I think the preprocess script should be able to run without you having to provide any labels. Do you have labels that you want to provide? If not, did you try running it without labels and got an error - if yes, could you share that error message (i.e. running without using --label)?