Open jordiclive opened 3 months ago
[Shipped] Will be working on these, hopefully I can update it tonight:
[Shipped] initial version of this.
Messy code is here: https://github.com/bigscience-workshop/ShadesofBias/blob/master/create_eval_dataset.py Apologies for the bit of spaghetti code, virtually everything I'm doing at the moment is "stream of consciousness" so I can go as quickly as possible between the different tasks on-hand. =)
@meg-huggingface nice, I used your logic https://github.com/bigscience-workshop/ShadesofBias/blob/master/example_logprob_evaluate.py here, It Iterates through the model_list and waits for wakeup so hopefully can hit run one time.
Evaluation
[x] Code to clean up Dataset/Map for HF release
[x] Add code to constrain generation to just a few tokens e.g. Y/N and retrieve probability, constraint](https://github.com/bigscience-workshop/ShadesofBias/commit/2aa441ef65c411520da5798edd64a34ee10195c6)(Models may be biased to answer Y, so may want free generation [
[x] Finalize Base Model List (Bloom, llama3, Qwen, mt5, PolyLM)
[x] Finalize Aligned Model List (BloomZ, llama3-instruct, Qwen-chat,mt0, cohere CMD,Aya Cohere) + (GPT4/4o, Claude Opus, Gemini, )
[ ] Run biased variations for Base Models + apply Min-Max scaling
[x] Output the predictions/log probs to prediction files, so that the Inference and Evaluation are separate.
[x] Aligned Model Evaluation Design (ensure necessary information can be extracted with Endpoint API, also consideration of closed APIs)
[ ] Conduct any other insightful evaluations for the paper leveraging the full dataset