dpfried / incoder

Generative model for code infilling and synthesis
291 stars 25 forks source link

Tokenizer does not have a padding token. #7

Open aissak21 opened 2 years ago

aissak21 commented 2 years ago

Hi.

I'm setting up to finetune InCoder via torch with Dynamic Padding as follows:

tokenizer = AutoTokenizer.from_pretrained("facebook/incoder-1B",use_fast=True,do_lower_case=False)

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4)

train_dataset = tokenized_datasets["train"].shuffle(seed=42)
test_dataset = tokenized_datasets["test"].shuffle(seed=42)

data_collator = DataCollatorWithPadding(tokenizer)

train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=8, collate_fn=data_collator)
test_dataloader = DataLoader(test_dataset, shuffle=True, batch_size=8, collate_fn=data_collator) 

I keep on getting this error.: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

I'm hesitant about using this fix: tokenizer.pad_token = tokenizer.eos_token since in this model the |<eos token>| "marks the start of a document to our model. " (unless they mean different to "encoding prepends the <|endoftext|> token")

I plan to go ahead and use 0 as the padding token (tokenizer.add_special_tokens({'pad_token': '[0]'})), as it is the default in most cases, but I was wondering to know what causes the error, as I suppose there is some to do with the token architecture.

Thanks!

dpfried commented 2 years ago

Hi, we didn't expose it in the tokenizer unfortunately (I should fix this!) but the model was trained with a pad token which has ID 1 and string "<pad>". I've been using code similar to the following in padded batched generation, although I haven't verified that it gives exactly the same results as unpadded generation, or tried it yet with DataCollatorWithPadding. Please let me know if it doesn't work and I will debug.

PAD = "<pad>"
tokenizer = AutoTokenizer.from_pretrained("facebook/incoder-1B")
tokenizer.pad_token = PAD
tokenizer.padding_side = "left"

model = AutoModelForCausalLM.from_pretrained("facebook/incoder-1B")
model.cuda()

doc1 = "def count_words(filename):\n"
doc2 = 'def count_words(filename):\n    """Count the words in the file"""'

dct = tokenizer([doc1, doc2], padding="longest", truncation=True, return_tensors="pt").to("cuda")
hypotheses_batch = model.generate(**dct, do_sample=True, temperature=0.2, top_p=0.95, max_length=256)

ID 0 would likely work as well - I suspect any ID which is not used in the inputs would be fine so long as the attention masks are set up properly (so that pad tokens are not attended to, which e.g. model.generate does when used as above) and 0 is a separate BOS token (<s>) defined by fairseq (which we used to train the model) but unused in our models, I believe. Sorry for not setting this up in a cleaner way!

aissak21 commented 2 years ago

Okay yes, will do! In the original padding setup we have the following: PreTrainedTokenizerFast(name_or_path='facebook/incoder-1B', vocab_size=50261, model_max_len=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>'})

With the changes above, I get the following: PreTrainedTokenizerFast(name_or_path='facebook/incoder-1B', vocab_size=50261, model_max_len=2048, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'pad_token': '<pad>'})

Just wondering if this will be helpful as we move forward and override the padding side.

To give the whole thing a go, I also have a follow-through in my training. One would post-process their tokenized_dataset as follows:tokenized_datasets = tokenized_datasets.rename_column("label", "labels") However, the only features in my dataset are ['text', 'input_ids', 'token_type_ids', 'attention_mask']. Since there is no corresponding "labels" column, I didn’t rename it. (I assume this exists for classification tasks, I'm doing code generation. But, I tried renaming “token_type_ids” to "labels", and just used ['input_ids', 'attention_mask'] (rid “token_type_ids”) and incurred errors in both cases as well, i.e. my batch did not contain any label, so my outputs did not have a real loss.

Can you also specify what features go into the model for training as such?

aissak21 commented 2 years ago
Screen Shot 2022-06-21 at 9 37 32 AM

So it works indeed. I'm not sure if the decoder is repeating outputs or if there's a better way to output this, but here is what I got!

dpfried commented 2 years ago

Great! Re: repetitions, that's expected - it does sometimes generate minor variations on previously-written functions, especially with low temperatures (like 0.2) in sampling.

I'm not totally sure how to fix the inputs in your training setup, but if it's useful here is some code that I used to finetune InCoder on the APPS dataset, based on some code from Hendrycks et al.: https://github.com/dpfried/apps/blob/main/train/tune_apps_gpt.pyI actually wasn't able to get padding to work during training (just during inference), as you can see from the commented-out DataCollatorWithPadding. I forget whether I ran into the same issue that you did.

If you're able to get it to work please let me know! I'll do the same - will hopefully have some time in a week or two to look at this more.

aissak21 commented 2 years ago

Amazing. Thank you. I will let you know for sure.

aissak21 commented 2 years ago

Hi. So I encountered another surprise before getting there. The labels column in my case refers to the need for a target column to train the model, so I copied input_ids as target in mitigation. Currently at this error: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. (which may be in part of padding and tokenization as well), but wanted to ask since this may be in part of the model being supervised or unsupervised (the need of a target)? Was wondering if you could chime in on this with InCoder since my data is simply code and not sure what the target would be :)

dpfried commented 2 years ago

Sorry for the slow response! Yes I believe that you're doing the right thing for target and labels (using input_ids) as model.forward uses them to calculate the loss in this way: https://github.com/huggingface/transformers/blob/d0acc9537829e7d067edbb791473bbceb2ecf056/src/transformers/models/xglm/modeling_xglm.py#L908

I'm pretty sure that this error you're seeing is the one that I ran into as well but I haven't yet had a chance to try to debug it, sorry. The HuggingFace forums may be the most helpful if you have a chance to ask there.

aissak21 commented 2 years ago

No worries. Thank you for everything. I shall do so. Will let you know if I come about anything as well.

anhnguyen7198 commented 1 year ago

@dpfried @aissak21 any updates on this issue?