Sentdex / Falcon-LLM

Helper scripts and examples for exploring the Falcon LLM models
Apache License 2.0
170 stars 46 forks source link

Best Practice for Handling Variable-Length Sequences in Training an LLM Model on a Chatbot Dataset #4

Open HumzaSami00 opened 1 year ago

HumzaSami00 commented 1 year ago

I am currently engaged in training Falcon (LLM) on a chatbot dataset, and I would appreciate some guidance on handling variable-length sequences within the dataset. The dataset consists of multiple examples of chat messages exchanged between user 1 and user 2, totaling around 500 such instances. Each example varies in the number of messages it contains, leading to differing sequence lengths. Here are two representative data points from the dataset:

Datapoint 1 = """user 1 : How are you ?\n user 2 : I am good. \n user 1 : What do you like ? \n user 2 : Apples"""

Datapoint 2 = """user 1 : How are you ?\nuser 2 : I am good.\n user 1 : What do you like in fruits?\n user 2 : Oranges \nuser 1 : Great me too\n user 2 : But sometimes I like mangoes \nuser 1 : seems intresting \n user 2 : Yeah"""

To facilitate the training process, I tokenized the dataset, setting a maximum_length of input_ids to 4 tokens, and handled overflowed tokens by padding them accordingly.

Now, my question is: in cases where a chat message contains fewer than 4 tokens, what is considered a best practice? Should I pad these shorter sequences to match the maximum length, or would it be more suitable to keep them as they are?

I would appreciate any insights or suggestions on the most appropriate approach for handling variable-length sequences in this context.