-
Error log: /mnt/tet/OpenChatKit-main/training
--model-name /mnt/tet/OpenChatKit-main/training/../pretrained/Pythia-6.9B-deduped/EleutherAI_pythia-6.9b-deduped/ --tokenizer-name /mnt/tet/OpenChatKit-…
-
Hi,
Thanks for the great work on evaluation LMs.
I am trying to evaluate GPT2 on squad_v2 dataset but got the following error.
```
Traceback (most recent call last):
File "/nfs/projects…
-
I have evaluated LLaMA (7B, 13B and 30B) in most of the tasks available in this library and the results are bad for some tasks. I will give some examples with the 7B model. I haven't checked all the r…
-
We've observed some tasks not learning at all over the first 100B tokens during pretraining. One of the suspicion we have is some tasks are bugged.
The current task list we run:
```python
arc_c…
-
- [x] arc_challenge | acc
- [x] arc_challenge | acc_norm
- [x] arc_easy | acc
- [x] arc_easy | acc_norm
- [x] boolq | acc
- [x] copa | acc
- [x] headqa | acc
- [x] headqa | acc_norm
- [x] hell…
-
It seems that the released checkpoints for MathQA and SVAMP have significantly worse performance than reported in the paper.
-
- [x] arc_challenge
- [x] arc_easy
- [x] boolq
- [x] copa
- [x] headqa (This is Spanish. WTF?)
- [x] hellaswag
- [ ] lambada
- [x] logiqa
- [x] mathqa
- [x] mc_taco
- [x] mrpc
-…
-
Consider removal of eval dataset contamination.
We would like to make sure that any downstream eval data is not memorized by the model by being in the sampled OSCAR files.
We would want to remov…
-
We would like to find or create an eval dataset that tests mathematical knowledge. The GRE exams may be a good source of material.
- [ ] Data processing code implemented
- [ ] Evaluation implemen…
-
It seems there are a bunch of errors while running for MathQA on my side.
Can you share the environment detail?
I think especially the allennlp version, I use the latest allennlp version as mentio…