There are many good benchmarks in lm-eval-harness [0] that have been verified to work well on GPT-J, so this is a good data source for visualization purposes.
One consideration is that some prompts in the harness might be long. So we might have to save the prompts as a sparse matrix, pruning values below a threshold (many values are near 0 anyways).
There are many good benchmarks in lm-eval-harness [0] that have been verified to work well on GPT-J, so this is a good data source for visualization purposes.
[0] https://github.com/EleutherAI/lm-evaluation-harness