Victorwz / LongMem

Official implementation of our NeurIPS 2023 paper "Augmenting Language Models with Long-Term Memory".
https://arxiv.org/abs/2306.07174
Apache License 2.0
763 stars 70 forks source link

how to build valid dataset #19

Closed dasemiao closed 11 months ago

dasemiao commented 1 year ago

I made a pile dataset, but how to divide the valid dataset. For my self-made validation set, I always get the error "Is a directory: '/home/mdz/pywork/LongMem/pile_preprocessed_binary/valid'

Victorwz commented 1 year ago

Can you provide more details to let me reproduce the error?

Bui1dMySea commented 1 year ago

I suggest that I have solved your problem. I try to generate a custom dataset following the format that the author given.However when I try to train a LongMem model.I meet an error "Is a directory: 'XXX/longmem/valid' ".I think the reason is that when the author writes the code, the version of fairseq is low, and valid and test binaries are not required to run.Up to now,I run this code by the fairseq version is 0.12 and you need to find the code under the fairseq subfolder like "xxx/longmem/fairseq/fairseq_cli/train.py" and just comment the code

    # Load valid dataset (we load training data below, based on the latest checkpoint)
    # We load the valid dataset AFTER building the model
    data_utils.raise_if_valid_subsets_unintentionally_ignored(cfg)
    if cfg.dataset.combine_valid_subsets:
        task.load_dataset("valid", combine=True, epoch=1)
    else:
        for valid_sub_split in cfg.dataset.valid_subset.split(","):
            task.load_dataset(valid_sub_split, combine=False, epoch=1)

In the official code,It's on lines 128 through 133.