UKGovernmentBEIS inspect_ai issues

UKGovernmentBEIS / inspect_ai

Inspect: A framework for large language model evaluations

https://inspect.ai-safety-institute.org.uk/

MIT License

567 stars 98 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Add `interactive` option to `web_browser()` for disabling interactive tools (clicking, typing, and submitting forms)

#642 jjallaire-aisi closed 15 minutes ago
0
For `basic_agent()`, defer to task `max_messages` if None is specified for the agent

#641 jjallaire-aisi closed 2 hours ago
0
Improve prompting for Python tool to emphasise need to print output

#640 jjallaire-aisi closed 3 hours ago
0
Provide token usage and raw model API calls for OpenAI o1-preview

#639 jjallaire closed 6 hours ago
0
Improved implementation of disabling parallel tool calling

#638 jjallaire closed 16 hours ago
0
Add dataset support for `.tsv` files

#637 manifoldhiker closed 6 hours ago
0
Web Browser tool cannot see checkbox state

#636 sdtblckgov opened 18 hours ago
0
improve quality of error messages when a model API key environment variable is missing

#635 jjallaire-aisi closed 19 hours ago
0
Display Target in ScoreEventView

#634 dragonstyle closed 20 hours ago
0
Provide setter for `max_messages` on `TaskState`

#633 jjallaire-aisi closed 20 hours ago
0
Rename `web_browser_tools()` to `web_browser()`, and don't export individual web browsing tools

#632 jjallaire-aisi closed 23 hours ago
0
improve prompting/descriptions for web_browser tools

#631 jjallaire-aisi closed 1 day ago
0
fix issue with failure to execute sample setup scripts

#630 jjallaire-aisi closed 1 day ago
0
allow chat evaluations using LLM-as-a-judge

#629 ProtD opened 1 day ago
0
Improve prompting for `</tool_call>` end sequence for Llama models

#628 jjallaire-aisi closed 1 day ago
0
move evals into inspect_evals package

#627 jjallaire-aisi closed 1 day ago
0
`auto_id` option for dataset readers to assign an auto-incrementing ID to records

#626 jjallaire-aisi closed 2 days ago
0
Error with `setup` field in Sample

#625 XkunW opened 2 days ago
9
task args: don't attempt to serialise registry objects that don't have captured parameters

#624 jjallaire-aisi closed 2 days ago
0
remove api_key from model_args

#623 jjallaire-aisi closed 2 days ago
0
gaia tweaks

#622 jjallaire-aisi closed 2 days ago
0
Migrate MMLU_Pro to Inspect Evals

#621 dragonstyle closed 2 days ago
0
Migrate MMLU to Inspect Evals

#620 dragonstyle closed 2 days ago
0
Adding the official GAIA scorer && Changing default arguments on gaia_dataset

#619 max-kaufmann closed 2 days ago
0
Inconsistent Sample IDs produced when shuffle=True

#618 evanmiller-anthropic closed 2 days ago
3
Migrate AGIEval to Inspect Evals

#617 dragonstyle closed 2 days ago
0
Dynamically allocating solver to a task gives ValueError in inspect ≥0.3.29 but not in ≤0.3.28

#616 sohaibimran7 closed 2 days ago
2
gdm_capabilities to inspect_evals

#615 jjallaire closed 2 days ago
0
Update correct aisi.gov.uk link in the top level README

#614 max-kaufmann closed 2 days ago
0
Update plan -> solver in GAIA README

#613 max-kaufmann closed 2 days ago
0
Migrate MBPP to Inspect Evals

#612 dragonstyle closed 2 days ago
0
Migrate PIQA to Inspect Evals

#611 dragonstyle closed 2 days ago
0
Migrate PubmedQA to Inspect Evals

#610 dragonstyle closed 2 days ago
0
Migrate Squad benchmark to Inspect Evals

#609 dragonstyle closed 2 days ago
0
Migrate Truthful QA to Inspect Evals

#608 dragonstyle closed 2 days ago
0
Migrate Winogrande benchmark to Inspect Evals

#607 dragonstyle closed 2 days ago
0
Migrate xstest benchmark to Inspect Evals

#606 dragonstyle closed 2 days ago
0
Remove already migrated Mathvista

#605 dragonstyle closed 2 days ago
0
Migrate MATH benchmark to Inspect Evals

#604 dragonstyle closed 2 days ago
1
Migrate ifeval benchmark to Inspect Evals

#603 dragonstyle closed 2 days ago
0
Migrate human eval to Inspect Evals

#602 dragonstyle closed 2 days ago
0
Migrate hellaswag benchmark to Inspect Evals

#601 dragonstyle closed 2 days ago
3
Migrate GSM8K to Inspect evals

#600 dragonstyle closed 2 days ago
0
Migrate commonsenseqa benchmark to evals

#599 dragonstyle closed 2 days ago
0
Migrate boolq to inspect_evals

#598 dragonstyle closed 2 days ago
0
don't use version tag for inspect_web_browser

#597 jjallaire-aisi closed 2 days ago
0
Fix HuggingFace dataset kwargs type

#596 MSchmatzAISI closed 2 days ago
0
Feature/web browser

#595 MariaIzobava closed 2 days ago
0
Always preserve first metadata value when reducing scores

#594 dragonstyle closed 3 days ago
2
support service prefixes for anthropic models

#593 jjallaire-aisi closed 3 days ago
0