How to build the training dataset of adaptive learning?

Victorwz / LongMem

Official implementation of our NeurIPS 2023 paper "Augmenting Language Models with Long-Term Memory".

https://arxiv.org/abs/2306.07174

Apache License 2.0

763 stars 70 forks source link

How to build the training dataset of adaptive learning? #20

Closed ZhibinDuan closed 11 months ago

ZhibinDuan commented 11 months ago

In your paper, the training dataset for memory adaptive learning includes 26B tokens, but I can't find the method for how to build the training dataset for adaptive learning.

Besides, the link of dataset can't download from its url: https://the-eye.eu/public/AI/pile/. can you help me solve above problems. Thanks for you.

Victorwz commented 11 months ago

Hi, Zhibin. I also found this issue that the Eleuther AI just put the Pile dataset offline. But I found that there is still an available resources on Huggingface dataset. Please check: https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile/tree/main

ZhibinDuan commented 10 months ago

I have downloaded the dataset from https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile/tree/main, which consists of 19 files, and the name is pile_train_deduped0-19.jsonl.

However, the dataset is a raw dataset, and I can't use your data processing code to sample a training dataset. Can you help me further?

Thank you very much.