In ShareGPT, why the conversation from human is accumulated?

declare-lab / flan-alpaca

This repository contains code for extending the Stanford Alpaca synthetic instruction tuning to existing instruction-tuned models such as Flan-T5.

Apache License 2.0

348 stars 38 forks source link

In ShareGPT, why the conversation from human is accumulated? #11

Open qmpham opened 1 year ago

qmpham commented 1 year ago

Line 329 data_loading.py

chiayewken commented 1 year ago

Hi, to train the model to generate GPT-like responses, we set the target sequence as the GPT response and input/source sequence as the previous dialog history.

qmpham commented 1 year ago

But LLAMA has input's max_len of only 2048 tokens

chiayewken commented 1 year ago

This can be handled by the data loader/tokenizer. For example, we truncate the input on the left side if it exceeds the max length:

https://github.com/declare-lab/flan-alpaca/blob/c90aad711df784ad3ca2336ac387dec72a9f7192/data_loading.py#L151

qmpham commented 1 year ago

yes, I understand. but why interested in having long input while the model's capacity is only 2048. You might risk of truncating the question to which the target is addressed

chiayewken commented 1 year ago

That's true, the dialog commonly exceeds the maximum sequence length while training. However, we can mitigate this by truncating inputs on the left side, so that the most recent dialog history on the right is preserved:

https://github.com/declare-lab/flan-alpaca/blob/c90aad711df784ad3ca2336ac387dec72a9f7192/data_loading.py#L136