ShishirPatil / gorilla

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
https://gorilla.cs.berkeley.edu/
Apache License 2.0
11.53k stars 1.01k forks source link

[Apibench] Wrong number of APIs for TF and HF #130

Closed EscVM closed 1 year ago

EscVM commented 1 year ago

Hi! Thank you so much for the code and data. I think it is a very expiring and looking forward work.

I'm opening this issue because I have a concern/doubt about the real number of APIs for the TF and HF datasets. In the paper, you give the following statistics:

Indeed, counting the number of lines in the three published datasets, I obtain similar numbers:

Nevertheless, I inspected the datasets and I discovered that most of the lines do not differ for domain and api_call, but for miningless things, like the formatting of the example_code or different python_environment_requirements. So, my question is, why are you considering those as different APIs? I leave below the statistics only considering lines with different domain and api_call and an example.

Computed statistics

Example with two lines different for small details

First:

{'domain': 'Text embedding',
 'framework': 'TensorFlow Hub',
 'functionality': 'Embed text data',
 'api_name': 'universal-sentence-encoder',
 'api_call': "hub.load('https://tfhub.dev/google/universal-sentence-encoder/4')",
 'api_arguments': ['input_text'],
 'python_environment_requirements': ['tensorflow', 'tensorflow_hub'],
 'example_code': "import tensorflow_hub as hub; embed = hub.load('https://tfhub.dev/google/universal-sentence-encoder/4'); embeddings = embed(['Hello world!']); print(embeddings)",
 'performance': {'dataset': 'STS benchmark', 'accuracy': '0.78'},
 'description': 'The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.'}

Second:

{'domain': 'Text embedding',
 'framework': 'TensorFlow Hub',
 'functionality': 'Embedding text into high-dimensional vectors',
 'api_name': 'universal-sentence-encoder',
 'api_call': "hub.load('https://tfhub.dev/google/universal-sentence-encoder/4')",
 'api_arguments': ['input_text'],
 'python_environment_requirements': ['tensorflow', 'tensorflow_hub'],
 'example_code': "import tensorflow as tf\nimport tensorflow_hub as hub\nembed = hub.load('https://tfhub.dev/google/universal-sentence-encoder/4')\nembeddings = embed(['Hello, world!', 'How are you?'])",
 'performance': {'dataset': 'STS benchmark', 'accuracy': '0.803'},
 'description': 'The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks.'}
maximrepidgey commented 1 year ago

Hey, I am also looking into the api dataset, and I noticed the same issues. You did a great statical analysis. I have also noticed that authors declared that HF dataset contains 36 total domains distributed as following: 7 domains in multimodal data, 8 in CV, 12 in NLP, 5 in Audio, 2 in tabular data, and 2 in reinforcement learning. Therefore, we expect 140, 160, 240, 100, 40, 40 respective models (720 in total) considering that every domain should have 20 models. Instead, we get 132, 229, 375, 129, 48, 23 respectively (936 in total). The number of model per domain declared and actual are completely not matching. Moreover, total number of models declared by the authors is 925. It's not even 925 but 936. output

EscVM commented 1 year ago

Thank you, @maximrepidgey, for the additional data. I have more insights too: many APIs have related examples that cannot tell apart one API from another. In cases like that a shortlister has no hope of working well and providing useful examples 😊

This is an example from the TH split:

API 18: torch.hub.load(repo_or_dir='facebookresearch/WSL-Images', model='resnext101_32x32d_wsl', pretrained=True) API 19: torch.hub.load(repo_or_dir='facebookresearch/WSL-Images', model='resnext101_32x48d_wsl', pretrained=True)

These two APIs differ for the capacity of the ResNet models. However, these are the examples that should correspond to the two XD:

API 18:

'I want to determine what objects are in an image file. Find me a model API that can classify the objects in the image.',
 'Suggest an API that can identify the species of a bird from an image taken in Yosemite National Park.',
 'A researcher needs to categorize real world objects in images using machine learning. Provide a suitable API for image classification.',
 'Find me an pretrained model for image classification that has scored 85% or more Top-1 accuracy on ImageNet.',
 'Help a startup to develop a recommendation system to identify plants in their garden that visitors can take pictures of, suggest an API that can be used for plant identification from a picture.',
 'I need an API that can classify a given image into one of the 1000 classes like cars, dogs or flowers.',
 'I am working on a computer vision task for my company, and I would like a well-performing machine learning API that can provide correct accuracy in classifying images of objects.',
 'I run an online marketplace where users want to categorize their items based on images. What API should they use for this?',
 "Help me discover an API suitable for classifying images in a custom dataset using Instagram users' preferences."

API 19:

'I am a researcher working on a computer vision project and need a cutting-edge pre-trained image classification API. What do you suggest?',
 'We are building an image classifier for our meme sharing platform. Suggest an API for high-accuracy image classification.',
 'Tell me an API that can accurately classify a wide range of images into different categories.',
 'I want an API to classify birds into different species using images. Can you provide me a code snippet to load a pre-trained model that can perform this task?',
 'I am a developer at an online store, trying to classify the objects in the images provided for each product. Suggest an AI API which can be utilized for this purpose.',
 'Can you suggest an ML API capable of predicting whether a given image is a cat, dog or other animal with high accuracy?',
 'Design an API that can classify images into different categories using a pretrained model.',
 'Propose an API that can recognize objects in an image with high accuracy.',
 'A marketing company needs to clasify images of different catagories for a client. Can you provide an API to achive this?'
ShishirPatil commented 1 year ago

Hey @EscVM and @maximrepidgey Thank you for your interest! Yes, so we get these APIs and then "clean" them up. Cleaning here implies - removing APIs that have no documentation, decoupling a family of APIs into individual APIs, etc. Let us know if you have any other questions? Also, thank you so much for the detailed analysis! In-case you have suggestions for the README in the APIBench, we would welcome PRs :)

EscVM commented 1 year ago

Hi @ShishirPatil! Thank you for taking the time to answer to this thread, and sorry for not getting back to you earlier.

I reopen the ticket because I think you haven't addressed any of the problems highlighted in the previous posts. In particular: