Adding Language specific validation sets to deepspeed

bigscience-workshop / multilingual-modeling

BLOOM+1: Adapting BLOOM model to support a new unseen language

https://arxiv.org/abs/2212.09535

Apache License 2.0

69 stars 15 forks source link

Adding Language specific validation sets to deepspeed #1

Open hadyelsahar opened 2 years ago

hadyelsahar commented 2 years ago

The idea of this issue to modify the megatron-deepspeed repository code that we use for training all models. In order to track the progress of validation loss on several validaiton sets separately. This would allow us to track the progress of training independtly on separate languages.

Currently, the validation loss is calculated on a single validation set that includes the same language combination as the training data. (see here 13B param model training on tensorboard)

Useful pointers

How datasets are loaded in model pre-training here
Dataset loader for GPT here
Validation step execution here

Progress

Forked deepspeed where all development happens (ask @hadyelsahar for invitation) here
Pull request: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/97

sbmaruf commented 2 years ago

I can review/implement this part.

lintangsutawika commented 2 years ago

My current understanding is that in training.py , the train, validation, and test datasets are loaded from a function build_train_valid_test_data_iterators.

https://github.com/hadyelsahar/Megatron-DeepSpeed/blob/9e14c02a1dd22e4d36e2ee9a33e44d33774b8de7/megatron/training.py#L123-L136

Evaluation is then done here, both for valid_data_iterator and test_data_iterator.

https://github.com/hadyelsahar/Megatron-DeepSpeed/blob/9e14c02a1dd22e4d36e2ee9a33e44d33774b8de7/megatron/training.py#L152-L166

We could set

and call evaluate_and_print_results iteratively for each language.

for each_language_data_loader in valid_data_iterator:
    evaluate_and_print_results(
        prefix, forward_step_func, 
        each_language_data_loader, 
        model, 
        eval_metric
    )

Some modification to evaluate_and_print_results will be required so that we save each validation metric for each language.

hadyelsahar commented 2 years ago

Currently the code base yields 1 single validation / test sets. There’s no support of adding args for the specifications of the multiple validation datasets.

my adhoc solution is to add an extra argument:

  --extra-valid-data-path [EXTRA_VALID_DATA_PATH ...]
Path to extra validation dataset to be monitored during trainingAccepted format: 
1) a single data path, 
2) multiple datasets in the form:data1-weight data1-path data2-path data2-weight yielding single validation set 
3) allow multiple validation sets by multiple (2) separated by commas in the form: data1-weight data1-path data2-weight data2-path, data3-weight3 data3-path data3-weight data3-path ...

The idea here is to allow mixing different validation sets on the fly

python pretrain_gpt2.py. …. --extra-valid-data-path. 0.5 en_data, 0.5 fr_data, 0.33 rare1_data 0.33 rare2_data 0.33 rare3_data

any thoughts about a better design?

hadyelsahar commented 2 years ago

work in progress PR sent here: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/97