is the phenomenon that sequential training hurts performance reasonable?

xcluo commented 3 years ago

Phenomenon: from the cost equivalent curve of Figure 2, sequential training uniformly outperform single task (e.g., target task is winogrande), but in my reproducing, the single task baseline always outperforms all sequential training (sequential training update steps vary from 5k to 50k, and interval is 5k), which violates the UINCORN Table 1 conclusion. download Experiment Setting:

T5-large, initialized with the checkpoint allenai/unifiedqa-t5-large
PyTorch
equal mixture, means that all five datasets contribute the same samples (we set 16113, equal to len(Physical IQA))
dynamic sampling (resample 5*16113 samples once an epoch)
batch_size=16, lr=5e-5 (we have experimentally proved that the lr > 1e-4 is not fit for the current model),
hardware=TeslaV100-PCIE-48GB
sequential training gradient updates ∈ [0:50001:5000]
target task is winogrande, during the target task fine-tune, we choose the record the highest Acc and apply early stopping strategy
we use the T5ForConditionalGeneration model which will output the decoder final state undergone softmax, called logits (of <batch_size, sequence_length, config.vocab_size> shape), and we then minimize the cross-entropy between labels (label's index element belongs to the text set {'1', '2', ..., 'number_of_options'}) and logits[:,0,:]

Question:

is the phenomenon that sequential training hurts performance reasonable?
is the margin (77.0%-74.51%=2.49%) compared to the UNICORN single task tolerable?
could you please comment some precious suggestions for us to further improve performance?

xcluo commented 3 years ago

@rlebras

nalourie-ai2 commented 2 years ago

Sequential training did not hurt performance in any of our experiments. However, while our tasks covered a broad range of domains, they were all multiple choice.

In the extreme, sequential training for long enough on totally random data seems like it would hurt performance. Based on the Unicorn experiments, I expect that sequential training will not hurt performance in most practical situations--but as an empirical finding, it's always possible there are important scenarios our experiments didn't cover.

allenai / rainbow

is the phenomenon that sequential training hurts performance reasonable? #46