Correctness of API Simulator

Thank you for your question.

We believe that exact replication of real API outputs is not necessary for API simulators; rather, the focus should be on providing rational responses. For instance, when querying today's weather, an API simulator need not fetch the actual temperature. Instead, it should produce a plausible temperature number. The term "correctness" may not be entirely appropriate here since any reasonable temperature can be deemed correct.

In our paper, we conduct a "Turing test" to illustrate that outputs from LLM-based simulations are virtually indistinguishable from real API responses, and that the diversity of these simulations mirrors that of actual APIs.

I concur that evaluating the tool learning capabilities of LLMs is crucial as a benchmark. Nevertheless, it is also important to ensure that these simulations operate within a realistic—or nearly realistic—and reliable framework. Hence, we carefully test the verisimilitude of our simulations.

THUNLP-MT / StableToolBench

Correctness of API Simulator #4