FastEval / FastEval

Fast & more realistic evaluation of chat language models. Includes leaderboard.
https://fasteval.github.io/FastEval/
Apache License 2.0
183 stars 23 forks source link

Tool usage #36

Open tju01 opened 1 year ago

tju01 commented 1 year ago

Something that can measure how well a LLM can deal with tools. CoT already kind of goes that way, but not really since it's limited to a bit of mathematical reasoning and not really tool usage.

tju01 commented 1 year ago
  1. https://github.com/OpenBMB/ToolBench works like the Vicuna benchmark and just asks an OpenAI model to evaluate the output. See https://github.com/OpenBMB/ToolBench/tree/master/toolbench/evaluation.
  2. https://github.com/ShishirPatil/gorilla compares the model output to the ground truth using some AST tree-matching method. See https://github.com/ShishirPatil/gorilla/tree/main/eval and the paper https://arxiv.org/pdf/2305.15334.pdf,
  3. https://github.com/sambanova/toolbench actually executes the used tools, so it works like EvalPlus. See https://github.com/sambanova/toolbench/tree/main/evaluator.
tju01 commented 1 year ago

https://github.com/sambanova/toolbench also already has a leaderboard here: https://huggingface.co/spaces/qiantong-xu/toolbench-leaderboard

tju01 commented 1 year ago

Also consider https://github.com/princeton-nlp/intercode

tju01 commented 1 year ago

Also https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks. And see https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/issues/8 for some other papers.

tju01 commented 1 year ago

Current state of my research:

Gorilla seems quite limited to evaluating knowledge about how to call a bunch of ML models on some input data. Could be part of a benchmark for tool usage, but it's just very limited alone. Would prefer to use something that evaluates more general tool-using capabilities.

https://github.com/OpenBMB/ToolBench is also not really that useful since it uses Vicuna-benchmark style evaluation which is not really something that can deal with complex complex reasoning well, but we need that.

https://github.com/sambanova/toolbench seems promising.

Intercode also seems promising.

Need too look more at Auto-GPT-Benchmarks and the referenced papers.

tju01 commented 1 year ago

Actually https://github.com/OpenBMB/ToolBench also does some other evaluation. See https://github.com/OpenBMB/ToolBench#model-experiment. It also does LLM grading, but it's not the only thing.

tju01 commented 1 year ago

https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks doesn't really look what I want here. There isn't much information on their GitHub repository, so I looked on their discord channel. It seems like https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/tree/master/agbenchmark/challenges is also related. There are only a small number of challenges and it also seems very integrated with AutoGPT. It's also very much WIP. Might be useful later, but not for now.

The list of referenced papers here https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/issues/8 might still be useful though.

tju01 commented 1 year ago

https://osu-nlp-group.github.io/Mind2Web/ might be useful.

tju01 commented 1 year ago

I had a look at https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/issues/8. Two papers might be relevant:

In addition, summarizing the other papers from above that might be relevant:

tju01 commented 1 year ago

https://arxiv.org/abs/2302.07842 seems like a good survey, though it is already a bit old.

tju01 commented 1 year ago

https://github.com/night-chen/ToolQA

tju01 commented 1 year ago

https://lilianweng.github.io/posts/2023-06-23-agent/

tju01 commented 1 year ago

API-Bank seems nice but it doesn't seem to be open source...

tju01 commented 1 year ago

https://github.com/thunlp/ToolLearningPapers

tju01 commented 1 year ago

Current things that seem interesting for evaluation:

Some of the papers are not more focused on a model than a benchmark, but they also include information about what benchmarks they evaluate on. Also sometimes the focus is more on planning than tool use, but I would like to evaluate both eventually.

Need to filter that list further to find only a few things to implement for now. Haven't even read through the papers above, just had a quick look and filtered other papers based on that.

tju01 commented 1 year ago

It's kind of ok, but the evaluation seems quite limited. They only evaluated a few llama-7b finetuned models on some very limited test data using just a comparison against a ground truth (with Rouge-L or ExactMatch). It's not useless, but I believe the evaluation results are of very limited usefulness.

Though I'm not quite sure yet how exactly evaluation is performed. Apparently for "Machine Evaluation" (the part I'm interested in), the intermediate & final steps are compared to some ground truth. But where does this GT come from (from ChatGPT?) and how do I generate model predictions for some other model?

tju01 commented 1 year ago

It's a pain to set up. It requires Java and a bunch of API keys for various services (Google Sheets API, Google Drive API, OpenWeather, The Cat API). And honestly, that's already enough to exclude that option. It would be ok if I would be the only one who has to set up this stuff. But I also want other people to be able to evaluate their models themself and this kind of setup is just too much.

tju01 commented 1 year ago

From reading the beginning of the paper, it seems quite nice. But the code is not yet fully open source. It seems like it will be open sourced (there was a commit two days ago), but it's not there yet.

tju01 commented 1 year ago

Seems pretty cool. They apparently also have some V2 version here https://github.com/princeton-nlp/attribute-tagging. Going to look more at this later.

tju01 commented 1 year ago

The WebShop V2 paper was apparently presented at the Language and Reinforcement Learning workshop at NeurIPS 2022. So it might be good to have a look at the other papers there to find some other methods.

tju01 commented 1 year ago

I'm not really convinced that "controlling how to use a bunch of models from huggingface" is a good benchmark. It's not really what users want to do, it's quite specific instead of being more general and doing good automatic evaluation is hard so their automatic evaluation is quite limited compared to other methods that treat the model like an agent and evaluate it in a simulated environment.

tju01 commented 1 year ago

Not quite sure about this specific environment. It outputs both pixels and the DOM tree. Pixels are no good because we deal with models that can only do natural language. DOM tree would be ok-ish, but I'm not sure whether one should expect existing models to be able to deal with that without further processing of the DOM. In https://github.com/posgnu/rci-agent they come up with a new model (DOMNET) "specifically designed to perform flexible relational reasoning over the tree-structured HTML representation of websites.". And they also train in the environment. So they can deal with it. But I'm interested in evaluating existing more general language models not specifically trained on this environment. So some environment that is more focused on that would be good.

Anyway. This specific environment doesn't seem to be good.

tju01 commented 1 year ago

Also gym contains some text environments by default but I don't think they are useful. Though environments that other people have built like the WebShop environment above might absolutely be useful. Just not the built in ones.

tju01 commented 1 year ago

This paper presents a model / method to obtain a model and not a benchmark. So the more interesting part is what they evaluate on:

  1. GSM8K-XL with a calculator
  2. Knowledge-based Question Answering with access to database APIs
  3. Actions for the VirtualHome environment.

The evaluation is pretty ok, though their code looks kind of hard to use. Could be done, but I would prefer some paper or so that presents a method that I could use more easily if it exists.

tju01 commented 1 year ago

While the tools are somewhat diverse, they only evaluated it on ScienceQA and TabMWP. Though they don't seem easy and the tool usage does seem somewhat sophisticated. Also seems to be well explained and well documented, so this might be good.

tju01 commented 1 year ago

At least in the original paper they implement it on a bunch of classical benchmarks. That's ok I guess and they do select more of those benchmarks where the tooling actually helps.

I guess in the end that will also be part of my solution. Though I think it would be nice to have some more benchmarks that are more specifically targeted for evaluating the tool use abilities.

tju01 commented 1 year ago

Seems to be quite specific to evaluating models that have been specifically trained using their method. Doesn't really evaluate more general models. I guess one could still make it work by rewriting the test data in some way and prompting the models in some way, but that's more generally true and I would prefer something simpler.

tju01 commented 1 year ago

I think it would certainly be pretty cool to evaluate a language model based on how well it can act as a planner for minecraft. But I think other things have priority, even though long term this would be pretty nice to have I guess.

tju01 commented 1 year ago

Seems pretty cool from a quick look. Need to look closer at it later. The wiki env might be interesting since I think retrival might be one of the most important tools. Guess it would also relate to memory if we learn to retrive not just from wikipedia but also from our past or so. Not sure.

tju01 commented 1 year ago

Yes, it's kind of. But I think other things are better than that VirtualHome environment.

tju01 commented 1 year ago

Ok, so to summarize the options that I still consider for now:

tju01 commented 1 year ago

Had a quick look at the papers here https://larel-workshop.github.io/papers/ and either there is no code for a paper or it's not really relevant for what I'm trying to do. So that option is out.

tju01 commented 1 year ago

AgentBench: https://arxiv.org/pdf/2308.03688.pdf