Request for Significant Token Masking code

dopc commented 2 years ago

Hey,

Congratulations for the great paper and thanks for sharing your code in Github!

I am working on another classification problem for my master thesis and want to try Significant Token Masking, as you do, in my research.

Is there any change you to share your data loading code (preprocessing, DataCollator etc.) to guide me?

BR.

kornosk commented 2 years ago

Hi @dopc - Thank you very much for your interest in our work.

Unfortunately, we cannot share code for the pre-training process because of some server issues. Also, our code is very old and already outdated based on the current HuggingFace code. I took a look at the current HuggingFace repo (v4.21.0). Here is the workaround to reproduce our KE-MLM algorithm.

Once you have the tokens you want to mask from log-odds-ratio, you can mask them by modifying the torch_mask_tokens & mask_tokens functions here and here.
You have to convert the tokens you want to mask to token_ids first.
The variable masked_indices indicates which tokens are to be masked in the samples of the current batch. So, you can for-loop over the input matrix and see if any input IDs are similar to the token_ids that you want to mask. If a match, so you can set the value of masked_indices for the token to be 1 (or True). This means the token will be masked in the process with the possibility of 80% as you may see in the function. 2.4 You can manually set values to mask those tokens with 100% probability too if you want.

Thank you again for your interest and please do not hesitate to contact me if you have any other questions.

dopc commented 2 years ago

Thanks for the quick reply!

The problems I have faced about what you suggest as workaround

I just played with torch_mask_tokens before this issue and actually implemented a similar process.

[Problem 1] The problem of mine is finding the tokens I want to mask, which you defined as Step 2 in the workaround you propose. The word I want to mask may tokenized as two tokens, for example the without or wishing etc., so the thing I need to replace with mask token becomes sequence of tokens. This is one problem I have faced.

[Problem 2] The other one is, as I know, BERT tokenizer may tokenize the same word differently w.r.t. the context of the word. I did not experiment this one deeply, but with few examples I just saw that happens. Because of that I could not get the token ids of my significant words and could not implement the workaround you proposed.

What I tried

Instead, I just did like that,

Replace significant word with the text <mask> with Datasets map function,
In torch_mask_tokens function choose 15% of them,
Apply 80% 10% 10% rule.

But then, I realized that, when I replace the word with <mask> the Step 2 of mine is meaningless, because I just already masked 100% of significant words I have and this is not what I want.

What I ask

Sorry for long text, but I wanted to give the whole picture.

To sum up, is there a way to solve the [Problem 1] and [Problem 2]?

BR.

kornosk commented 2 years ago

@dopc You're on the right track. I had those problems before. Here is what I've done to bypass them.

Problem 1: I simply get the token IDs of the stance tokens first. For example, Biden are bid and #en (I just made this up), then they are mapped to IDs 123 and 456. So when I mask, I search for a sequence of token IDs 123 and 456. If 123 is not followed by 456 then I do not mask. If 123 and the next one is 456, I do a mask. Does this make sense to you?

Problem 2: I just ignore this case. Basically, I only search for token IDs that exactly match what I look for. This problem is a good potential future work to investigate more.

Hope this helps!! Feel free to ask any other questions if you have seen other problems. I might be able to help. 👍

GU-DataLab / stance-detection-KE-MLM