how to build valid dataset

dasemiao commented 1 year ago

I made a pile dataset, but how to divide the valid dataset. For my self-made validation set, I always get the error "Is a directory: '/home/mdz/pywork/LongMem/pile_preprocessed_binary/valid'

Victorwz commented 1 year ago

Can you provide more details to let me reproduce the error?

Bui1dMySea commented 1 year ago

I suggest that I have solved your problem. I try to generate a custom dataset following the format that the author given.However when I try to train a LongMem model.I meet an error "Is a directory: 'XXX/longmem/valid' ".I think the reason is that when the author writes the code, the version of fairseq is low, and valid and test binaries are not required to run.Up to now,I run this code by the fairseq version is 0.12 and you need to find the code under the fairseq subfolder like "xxx/longmem/fairseq/fairseq_cli/train.py" and just comment the code

    # Load valid dataset (we load training data below, based on the latest checkpoint)
    # We load the valid dataset AFTER building the model
    data_utils.raise_if_valid_subsets_unintentionally_ignored(cfg)
    if cfg.dataset.combine_valid_subsets:
        task.load_dataset("valid", combine=True, epoch=1)
    else:
        for valid_sub_split in cfg.dataset.valid_subset.split(","):
            task.load_dataset(valid_sub_split, combine=False, epoch=1)

In the official code,It's on lines 128 through 133.

Victorwz / LongMem

how to build valid dataset #19