Better concatenation and individual metrics when using multiple text datasets

bhavul commented 9 months ago

For text task, when we would have multiple datasets, concatenation strategy could be moved to a more sophisticated logic by using huggingface concatenation.

Further, we may wish to change the evaluation loop to also give metrics individual to each dataset besides the average.

The text task looks good so far, I am curious about the choice / what you think is the best way to handle having multiple datasets. Are there speed benefits following the process here, of concatenating the datasets? If we had separate tasks, then we would also want to calculate the total tokens for each task / proportionally calculate how much of each batch comes from each task depending on # of tokens, but we don't have to worry about this if following your procedure. It seems like there is an edge case where the concatenation will not work if the columns are not named the same: https://huggingface.co/docs/datasets/process#concatenate

One thing that may be useful, is that if we have multiple datasets which are concatenated, is during evaluation, is to determine specific metrics associated w/ each separate dataset. E.g., want a separate perplexity score for wikitext vs the pile, not just the average between both. Potentially, after contenating, can maybe maintain start and end indices for each dataset, e.g. pile is 0 to 200mil, other dataset is (200mil + 1) to 400mil, so we can attribute which samples correspond to each task, separately aggregate their metrics.

Another strategy is that during training, we just track the average, but after training finishes, you essentially load model, e.g. eval.py, just running over each of your tasks separately, where text_datasets={the specific dataset you want your eval metrics over}, but may be inconvenient

_Originally posted by @daniellawson9999 in https://github.com/ManifoldRG/NEKO/pull/1#discussion_r1299509872_

Shravya-Kasturi commented 6 months ago

Hi @bhavul , I would like to work on this issue. Anything i should know before I start working?

pritam5756 commented 4 weeks ago

hi @bhavul is this issue fixed?

harshsikka commented 3 weeks ago

@Pritam-hakingmaster this is a good issue to get started with - the basic concatenation is implemented using hugginface datasets (see line 29 in gato/tasks/text_task.py)

The individual metrics per dataset is an open challenge that no one has taken on yet, worth doing

pritam5756 commented 3 weeks ago

I will try my best.

ManifoldRG / NEKO

Better concatenation and individual metrics when using multiple text datasets #22