Thanks for providing the script for reading the data into a dictionary. Could you please provide some extension showing how this dictionary is actually loaded as a torch dataset that be used for fine-tuning the base LLM? Have you defined a custom dataset class for this? Do you define a single dataloader iterating over the full concatenation of datasets, or separate dataloaders per dataset ? How is data serialisation implemented?
Thanks for providing the script for reading the data into a dictionary. Could you please provide some extension showing how this dictionary is actually loaded as a torch dataset that be used for fine-tuning the base LLM? Have you defined a custom dataset class for this? Do you define a single dataloader iterating over the full concatenation of datasets, or separate dataloaders per dataset ? How is data serialisation implemented?