Refactor overall directory structure

jaketae commented 3 years ago

This PR refactors the overall codebase to streamline the evaluation pipeline. The proposed interface is now

python evaluation/eval.py --model_name_or_path gpt2 --eval_tasks lambada tydiqa_secondary wmt

eval.py is the main driver script and the entry point into the repository. Each evaluation task inherits from the AutoTask class. AutoTask.evaluate() will run the evaluation with the specified model and produce metrics.

jaketae commented 3 years ago

If someone runs evaluation with multiple tasks, should we dump a single JSON file with aggregated results, or dump one JSON file per task? I'm personally more inclined towards the former, but curious to hear what other people think. @wilsonyhlee @tttyuntian

wilsonyhlee commented 3 years ago

If someone runs evaluation with multiple tasks, should we dump a single JSON file with aggregated results, or dump one JSON file per task? I'm personally more inclined towards the former, but curious to hear what other people think. @wilsonyhlee @tttyuntian

I tend to prefer a "save as you go" pattern when it comes to large-scale model evals primarily because the compute is costly and one can run into unexpected uncaught exceptions on cloud-based GPUs (network, memory, etc). Outputting individual JSONs allow us to resume an eval job without losing progress should an exception occur. And it's trivial for the user to aggregate all the individual JSONs into a single JSON -- alternatively, we can pretty easily add that aggregation at the end of the script and have an all_tasks.json next to the individual JSONs.

What do you think?

jaketae commented 3 years ago

Ah, the possibility of exceptions or odd interruptions is a good point. What do you think about "append as we go"? Namely, we can have a single file that aggregates all results, but implement the saving logic in such a way that the result of each task evaluation is appended to that file upon completion. That way, even if some error occurs, we will have an aggregated result up that point.

wilsonyhlee commented 3 years ago

Ah, the possibility of exceptions or odd interruptions is a good point. What do you think about "append as we go"? Namely, we can have a single file that aggregates all results, but implement the saving logic in such a way that the result of each task evaluation is appended to that file upon completion. That way, even if some error occurs, we will have an aggregated result up that point.

We certainly could. My only concern is that if we ever want to distribute and run these tasks in parallel machines / GPUs, we could run into an exception when 2 processes try to open and write to the same file simultaneously (drawing from past bad experiences..).

Saving multiple files is definitely not very elegant. But I do think it's the less likely option for random exceptions. What do you think?

jaketae commented 3 years ago

That's a good point. I was operating under the assumption that we would be running this on a single GPU with a VRAM of 48GB. I don't have much experience with distributed systems, but I can see how it could make things more complicated.

Do you think it would be helpful to ask for input from people who would be running this, i.e. the modeling group?

tttyuntian commented 3 years ago

If someone runs evaluation with multiple tasks, should we dump a single JSON file with aggregated results, or dump one JSON file per task? I'm personally more inclined towards the former, but curious to hear what other people think. @wilsonyhlee @tttyuntian

I tend to prefer a "save as you go" pattern when it comes to large-scale model evals primarily because the compute is costly and one can run into unexpected uncaught exceptions on cloud-based GPUs (network, memory, etc). Outputting individual JSONs allow us to resume an eval job without losing progress should an exception occur. And it's trivial for the user to aggregate all the individual JSONs into a single JSON -- alternatively, we can pretty easily add that aggregation at the end of the script and have an all_tasks.json next to the individual JSONs.

What do you think?

I would vote for this "save as you go + aggregation at the end" approach. If we would want to run the experiments in parallel with multiple GPUs, this approach is one of the easiest and safest solutions.

But yes, we should double check with, e.g. the modeling group, if the final evaluation script would run on multiple GPUs. If the answer is yes, we should think about whether to run each task with multiple GPUs but in a one-by-one fashion, or to run multiple tasks in parallel.

What do you think?

bigscience-workshop / evaluation

Refactor overall directory structure #52