facebookresearch / DomainBed

DomainBed is a suite to test domain generalization algorithms
MIT License
1.42k stars 298 forks source link

Question about the train & val splits #133

Closed X-funbean closed 5 months ago

X-funbean commented 1 year ago

Hi, in the code for splitting the dataset into train and val sets, it seems that the samples in a domain are randomly splitted without consideration for their categories, which may cause a problem: chances are that the samples in the train or val sets may not contain all categories (e.g., the val sets may only contain 100 categories, while there are 345 categories in total for the domain). I wonder whether this will impact the effectiveness for hyperparameter searching and result for leave-one-domain-out evaluation?

def split_dataset(dataset, n, seed=0):
    """
    Return a pair of datasets corresponding to a random split of the given
    dataset, with n datapoints in the first dataset and the rest in the last,
    using the given random seed
    """
    assert(n <= len(dataset))
    keys = list(range(len(dataset)))
    np.random.RandomState(seed).shuffle(keys)
    keys_1 = keys[:n]
    keys_2 = keys[n:]
    return _SplitDataset(dataset, keys_1), _SplitDataset(dataset, keys_2)
piotr-teterwak commented 5 months ago

Hi @X-funbean ,

You're right that the code does not consider categories in random splitting, but I'm not sure how much it will affect accuracy. If a class is so infrequent that it might not be represented in the validation split, it also likey will not change overall accuracy much.

An interesting follow up question is if the class distribution for the source data is similar to the class distribution in the target data. If not, then perhaps a resampling of the validation data based on class might be useful.

Closing for now, feel free to reopen if you have more questions.