attention mask for different documents in dataset chunk

waterhorse1 commented 1 year ago

Hi chaoyi,

Thanks for your great work. I have a question about dataset tokenization in the following code.

https://github.com/chaoyi-wu/Finetune_LLAMA/blob/1d4280e12f584b20cbb92a9f0dfe3a12a5de9bdc/Data_sample/tokenize_dataset.py#L38-L47

From my understanding I think this data preprocessing will cause the fact that different documents might be included in the same data chunk. For example, the first document might take 512 tokens while the second document takes 128 tokens in a chunk of 640 tokens. In this case, I think the generation for the second document should not see the first document, so we might need to use an attention mask to mask the first documents for the second document generation. Am I correct?

chaoyi-wu commented 1 year ago

Thanks for your recognition.

Yes, your understanding is correct.

Since this project is a tutorial, the code here mainly targets simplifying the main codes, avoiding some dirty padding operations, and making the whole training flow more readable.

In practice, such pre-processing way is only suitable for some large chaos corpus for pre-training. In most cases, you need to replace the dataset Python document with your own and add correct attention and padding masks based on your data characteristics.

waterhorse1 commented 1 year ago

@chaoyi-wu Thanks for your answer! I also meet one problem when running finetune_pp_peft_trainer_lora.sh,

ValueError: FlatParameter requires uniform requires_grad, any idea why this happens?

chaoyi-wu commented 1 year ago

Yes, the FSDP with Lora has this bug and we are going to fix this, you may use Deepspeed instead if you are working with Lora

chaoyi-wu / Finetune_LLAMA

attention mask for different documents in dataset chunk #2