Raw extraction - Githubissues

extracts hiddens without applying templates or making contrast tuples
can be used with eval by specifying a magic dataset “raw” and including --data_dir
- doesn't support few-shot examples, yes balancing by default (though optional for everything now), no streaming (enforced in PromptConfig's __post_init__)
Add support for inference without contrast tuples in Reporter
- renaming score to score_contrast_tuple
- I'm not sure if I should just make them be the same function and do different things depending on the shape of the input
Columns of provided dataset in --data_dir must contain string “text” and binary “label”, and it shouldn't have any splits
In this mode the LM total logprob assigned to the text is also computed
- That way you can perform ~whatever analyses you want by defining the input dataset and reading the output CSV
- I prepend tokenizer.bos_token to the input so that I can compute this. Will this always work and be in distribution?
Adds base_fingerprint argument to the builder which reads the fingerprint of the raw dataset to improve caching as the raw datasets are modified
Adds support for saving the predictions to an output directory with --preds_out_dir

EleutherAI / elk