Closed X-funbean closed 5 months ago
Hi @X-funbean ,
You're right that the code does not consider categories in random splitting, but I'm not sure how much it will affect accuracy. If a class is so infrequent that it might not be represented in the validation split, it also likey will not change overall accuracy much.
An interesting follow up question is if the class distribution for the source data is similar to the class distribution in the target data. If not, then perhaps a resampling of the validation data based on class might be useful.
Closing for now, feel free to reopen if you have more questions.
Hi, in the code for splitting the dataset into train and val sets, it seems that the samples in a domain are randomly splitted without consideration for their categories, which may cause a problem: chances are that the samples in the train or val sets may not contain all categories (e.g., the val sets may only contain 100 categories, while there are 345 categories in total for the domain). I wonder whether this will impact the effectiveness for hyperparameter searching and result for leave-one-domain-out evaluation?