Something about training on Pile

szrrr04 commented 3 months ago

Hello, I'd like to ask about training datasets. I want to make some modifications based on this DCFormer model, so I might need to compare the performance of the model before and after the modifications on the same dataset. However, the complete Pile dataset is really too large, and transferring it all to the supercomputing platform would be very cumbersome. So, I’m thinking of only using 5 subsets of the dataset, approximately 10GB in size, for training, and then observing the evaluation results. Is this approach okay? (For example, by comparing the training loss curves, etc.)

Lisennlp commented 3 months ago

I think you need at least 100G of data to do the experiment.

Lisennlp commented 3 months ago

However, according to our experimental observations, in most experiments, the difference can be seen with 10G. But for the sake of rigor, of course, the more data, the better.

szrrr04 commented 3 months ago

Okay，thank you！

Caiyun-AI / DCFormer

Something about training on Pile #6