Question about the train & val splits

facebookresearch / DomainBed

DomainBed is a suite to test domain generalization algorithms

MIT License

1.42k stars 298 forks source link

Hi, in the code for splitting the dataset into train and val sets, it seems that the samples in a domain are randomly splitted without consideration for their categories, which may cause a problem: chances are that the samples in the train or val sets may not contain all categories (e.g., the val sets may only contain 100 categories, while there are 345 categories in total for the domain). I wonder whether this will impact the effectiveness for hyperparameter searching and result for leave-one-domain-out evaluation?

def split_dataset(dataset, n, seed=0):
    """
    Return a pair of datasets corresponding to a random split of the given
    dataset, with n datapoints in the first dataset and the rest in the last,
    using the given random seed
    """
    assert(n <= len(dataset))
    keys = list(range(len(dataset)))
    np.random.RandomState(seed).shuffle(keys)
    keys_1 = keys[:n]
    keys_2 = keys[n:]
    return _SplitDataset(dataset, keys_1), _SplitDataset(dataset, keys_2)

facebookresearch / DomainBed

Question about the train & val splits #133