Collect a list of all datasets used - Githubissues

Adamliu1 / SNLP_GCW

3 stars 0 forks source link

Collect a list of all datasets used #90

Open Willmish opened 1 month ago

Willmish commented 1 month ago

Updated 2024-07-01.

Datasets:

Used for evaluation:
- MMLU: https://huggingface.co/datasets/hails/mmlu_no_train
- ARC-Challenge: https://huggingface.co/datasets/allenai/ai2_arc
- HellaSwag: https://huggingface.co/datasets/Rowan/hellaswag
- GSM8K (? look below): https://huggingface.co/datasets/openai/gsm8k
- Winogrande: https://huggingface.co/datasets/allenai/winogrande
- TruthfulQA MC2: https://huggingface.co/datasets/truthfulqa/truthful_qa
- PIQA: https://huggingface.co/datasets/ybisk/piqa
- French:
- FrenchBench:
- Harmfulness:
- PKU-Alignment/BeaverTails
- (?) toxigen
- (?) realtoxicityprompts
- Logical reasoning:
- lucasmccabe/logiqa
- ...
Used for unlearning:
- Unlearn Sets:
- French: AgentPublic/piaf
- Harmfulness: PKU-Alignment/PKU-SafeRLHF
- ~~Logical reasoning: sail/symbolic-instruction-tuning~~
- mnli instead of symbolic IT.
- Retain Sets:
- squad (with eval on squadv2 because of availability in lm-eval-harness)
- ~~google-research-datasets/nq_open or truthfulqa/truthful_qa (repeat-padded)?~~

TheRootOf3 commented 4 weeks ago

Sent to Eduardo on 04-06-2024!

TheRootOf3 commented 2 days ago

Updates on 2024-07-01: Logical reasoning: Replaced sail/symbolic-instruction-tuning with mnli because of random guess performance on the former.

GSM8K: Evaluation takes very long time due to its "generative" nature and large dataset size -- potentially to be dropped.

Retain set: Changed to squad, while evaluation is performed on squadv2 (which is essentially squad + some "tricky" question without answers).

@TheRootOf3 Reconsider adding toxigen as well as real toxicity prompts.