issues
search
UKGovernmentBEIS
/
inspect_ai
Inspect: A framework for large language model evaluations
https://inspect.ai-safety-institute.org.uk/
MIT License
567
stars
98
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Add `interactive` option to `web_browser()` for disabling interactive tools (clicking, typing, and submitting forms)
#642
jjallaire-aisi
closed
15 minutes ago
0
For `basic_agent()`, defer to task `max_messages` if None is specified for the agent
#641
jjallaire-aisi
closed
2 hours ago
0
Improve prompting for Python tool to emphasise need to print output
#640
jjallaire-aisi
closed
3 hours ago
0
Provide token usage and raw model API calls for OpenAI o1-preview
#639
jjallaire
closed
6 hours ago
0
Improved implementation of disabling parallel tool calling
#638
jjallaire
closed
16 hours ago
0
Add dataset support for `.tsv` files
#637
manifoldhiker
closed
6 hours ago
0
Web Browser tool cannot see checkbox state
#636
sdtblckgov
opened
18 hours ago
0
improve quality of error messages when a model API key environment variable is missing
#635
jjallaire-aisi
closed
19 hours ago
0
Display Target in ScoreEventView
#634
dragonstyle
closed
20 hours ago
0
Provide setter for `max_messages` on `TaskState`
#633
jjallaire-aisi
closed
20 hours ago
0
Rename `web_browser_tools()` to `web_browser()`, and don't export individual web browsing tools
#632
jjallaire-aisi
closed
23 hours ago
0
improve prompting/descriptions for web_browser tools
#631
jjallaire-aisi
closed
1 day ago
0
fix issue with failure to execute sample setup scripts
#630
jjallaire-aisi
closed
1 day ago
0
allow chat evaluations using LLM-as-a-judge
#629
ProtD
opened
1 day ago
0
Improve prompting for `</tool_call>` end sequence for Llama models
#628
jjallaire-aisi
closed
1 day ago
0
move evals into inspect_evals package
#627
jjallaire-aisi
closed
1 day ago
0
`auto_id` option for dataset readers to assign an auto-incrementing ID to records
#626
jjallaire-aisi
closed
2 days ago
0
Error with `setup` field in Sample
#625
XkunW
opened
2 days ago
9
task args: don't attempt to serialise registry objects that don't have captured parameters
#624
jjallaire-aisi
closed
2 days ago
0
remove api_key from model_args
#623
jjallaire-aisi
closed
2 days ago
0
gaia tweaks
#622
jjallaire-aisi
closed
2 days ago
0
Migrate MMLU_Pro to Inspect Evals
#621
dragonstyle
closed
2 days ago
0
Migrate MMLU to Inspect Evals
#620
dragonstyle
closed
2 days ago
0
Adding the official GAIA scorer && Changing default arguments on gaia_dataset
#619
max-kaufmann
closed
2 days ago
0
Inconsistent Sample IDs produced when shuffle=True
#618
evanmiller-anthropic
closed
2 days ago
3
Migrate AGIEval to Inspect Evals
#617
dragonstyle
closed
2 days ago
0
Dynamically allocating solver to a task gives ValueError in inspect ≥0.3.29 but not in ≤0.3.28
#616
sohaibimran7
closed
2 days ago
2
gdm_capabilities to inspect_evals
#615
jjallaire
closed
2 days ago
0
Update correct aisi.gov.uk link in the top level README
#614
max-kaufmann
closed
2 days ago
0
Update plan -> solver in GAIA README
#613
max-kaufmann
closed
2 days ago
0
Migrate MBPP to Inspect Evals
#612
dragonstyle
closed
2 days ago
0
Migrate PIQA to Inspect Evals
#611
dragonstyle
closed
2 days ago
0
Migrate PubmedQA to Inspect Evals
#610
dragonstyle
closed
2 days ago
0
Migrate Squad benchmark to Inspect Evals
#609
dragonstyle
closed
2 days ago
0
Migrate Truthful QA to Inspect Evals
#608
dragonstyle
closed
2 days ago
0
Migrate Winogrande benchmark to Inspect Evals
#607
dragonstyle
closed
2 days ago
0
Migrate xstest benchmark to Inspect Evals
#606
dragonstyle
closed
2 days ago
0
Remove already migrated Mathvista
#605
dragonstyle
closed
2 days ago
0
Migrate MATH benchmark to Inspect Evals
#604
dragonstyle
closed
2 days ago
1
Migrate ifeval benchmark to Inspect Evals
#603
dragonstyle
closed
2 days ago
0
Migrate human eval to Inspect Evals
#602
dragonstyle
closed
2 days ago
0
Migrate hellaswag benchmark to Inspect Evals
#601
dragonstyle
closed
2 days ago
3
Migrate GSM8K to Inspect evals
#600
dragonstyle
closed
2 days ago
0
Migrate commonsenseqa benchmark to evals
#599
dragonstyle
closed
2 days ago
0
Migrate boolq to inspect_evals
#598
dragonstyle
closed
2 days ago
0
don't use version tag for inspect_web_browser
#597
jjallaire-aisi
closed
2 days ago
0
Fix HuggingFace dataset kwargs type
#596
MSchmatzAISI
closed
2 days ago
0
Feature/web browser
#595
MariaIzobava
closed
2 days ago
0
Always preserve first metadata value when reducing scores
#594
dragonstyle
closed
3 days ago
2
support service prefixes for anthropic models
#593
jjallaire-aisi
closed
3 days ago
0
Next