Question about the different vocabsize between bllip-lg model and GPT2Tokenizer

zhaoyd1 commented 1 month ago

Hello, I've been trying to evaluate the model 'bllip-lg' with your provided code. However, I found that the dataset is preprocessed with GPT2Tokenizer, which has a vocabsize of 50257 while the size of vocab.pkl is different from that. I failed to evaluate it directly with the file 'eval_pushdown_model.py' due to this problem, because some tokens do not exist in vocab.pkl after being processed by GPT2Tokenizer. I wonder how to deal with this problem?

MurtyShikhar commented 1 month ago

Hi, Yida, since the model is trained on just the BLLIP-LG data, many tokens from the GPT2tokenizer are actually absent. Could I ask what dataset you're trying to evaluate the model on? If it's the PTB, then we manually removed all sentences that had OOV tokens (also mentioned in the paper).

zhaoyd1 commented 1 month ago

Hi, Yida, since the model is trained on just the BLLIP-LG data, many tokens from the GPT2tokenizer are actually absent. Could I ask what dataset you're trying to evaluate the model on? If it's the PTB, then we manually removed all sentences that had OOV tokens (also mentioned in the paper).

Thank you for your reply. I want to evaluate it on BLLIP-LG test set and I could give an example of the error in tokenization. Using GPT2tokenizer, the word 'cannot' will be tokenized into a whole word 'Ġcannot'. But in the downloaded vocab.pkl, 'Ġcannot' does not exist (only 'Ġcann' and 'ot' exists). Such words may need to be retokenized or is there some tricks to deal with the problem?

MurtyShikhar commented 1 month ago

Gotcha. That is not how "cannot" was tokenized in my copy of BLLIP-LG. Here is an example: (S (SBAR (IN If) (S (NP (DT Ġthe) (NP|<NN-NN> (NN Ġasbestos) (NN Ġremoval))) (VP (VBZ Ġis) (VP|<RB-VP> (RB Ġnot) (VP (VBN Ġcompleted) (SBAR (IN Ġbefore) (S (NN Ġschool) (VP (VBZ Ġresumes) (PP (IN Ġin) (NP (DT Ġthe) (NN Ġfall))))))))))) (S|<,-NN-VP-.> (, Ġ,) (S|<NN-VP-.> (NN Ġwork) (S|<VP-.> (VP (MD Ġmust) (VP (VB Ġbe) (VP (VBN Ġhalted) (VP|<PP-SBAR> (PP (IN Ġuntil) (NP (DT Ġthe) (NP|<VBG-NN> (VBG Ġfollowing) (NN Ġsummer)))) (SBAR (IN Ġbecause) (S (PRP Ġit) (VP (MD Ġcan) (VP|<RB-VP> (RB Ġnot) (VP (VB Ġbe) (VP (VBN Ġdone) (SBAR (IN Ġwhile) (S (NNS Ġstudents) (VP (VBP Ġare) (PP (IN Ġin) (NN Ġschool))))))))))))))) (. Ġ.)))))

You should run the process_bllip.py script inside data_utils on your copy of the BLLIP-LG test set.

zhaoyd1 commented 2 weeks ago

Gotcha. That is not how "cannot" was tokenized in my copy of BLLIP-LG. Here is an example: (S (SBAR (IN If) (S (NP (DT Ġthe) (NP|<NN-NN> (NN Ġasbestos) (NN Ġremoval))) (VP (VBZ Ġis) (VP|<RB-VP> (RB Ġnot) (VP (VBN Ġcompleted) (SBAR (IN Ġbefore) (S (NN Ġschool) (VP (VBZ Ġresumes) (PP (IN Ġin) (NP (DT Ġthe) (NN Ġfall))))))))))) (S|<,-NN-VP-.> (, Ġ,) (S|<NN-VP-.> (NN Ġwork) (S|<VP-.> (VP (MD Ġmust) (VP (VB Ġbe) (VP (VBN Ġhalted) (VP|<PP-SBAR> (PP (IN Ġuntil) (NP (DT Ġthe) (NP|<VBG-NN> (VBG Ġfollowing) (NN Ġsummer)))) (SBAR (IN Ġbecause) (S (PRP Ġit) (VP (MD Ġcan) (VP|<RB-VP> (RB Ġnot) (VP (VB Ġbe) (VP (VBN Ġdone) (SBAR (IN Ġwhile) (S (NNS Ġstudents) (VP (VBP Ġare) (PP (IN Ġin) (NN Ġschool))))))))))))))) (. Ġ.)))))

You should run the process_bllip.py script inside data_utils on your copy of the BLLIP-LG test set.

Thanks for your help. I've solved my problem. By the way, I noticed that in your config, ff_multiplier is set to 1 which is quite different from the convention like 3 or 4. It is also a different setting from TG. Is there any trick in setting such a hyperparameter?

MurtyShikhar / Pushdown-Layers

Question about the different vocabsize between bllip-lg model and GPT2Tokenizer #1