Open Willmish opened 1 month ago
Sent to Eduardo on 04-06-2024!
Updates on 2024-07-01: Logical reasoning: Replaced sail/symbolic-instruction-tuning with mnli because of random guess performance on the former.
GSM8K: Evaluation takes very long time due to its "generative" nature and large dataset size -- potentially to be dropped.
Retain set: Changed to squad, while evaluation is performed on squadv2 (which is essentially squad + some "tricky" question without answers).
@TheRootOf3 Reconsider adding toxigen as well as real toxicity prompts.
Updated 2024-07-01.
Datasets:
Logical reasoning: sail/symbolic-instruction-tuninggoogle-research-datasets/nq_open or truthfulqa/truthful_qa (repeat-padded)?