HowieHwong / MetaTool

[ICLR 2024] MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
MIT License
60 stars 8 forks source link

Question for "Construct prompt data" #7

Open jianguoz opened 7 months ago

jianguoz commented 7 months ago

Hi @HowieHwong , thanks for sharing the excellent work.

We are working on evaluating our models based on your benchmark. However, we have difficulties to serve milvus and docker on Google Cloud GCP pods. Is it possible for you to share both your constructed prompt data and database to us?

Thank you and look forward to hearing back from you!

HowieHwong commented 7 months ago

Hi,

I have uploaded the datasets in dataset/tmp_dataset. You can check about it.

Hi @HowieHwong , thanks for sharing the excellent work.

We are working on evaluating our models based on your benchmark. However, we have difficulties to serve milvus and docker on Google Cloud GCP pods. Is it possible for you to share both your constructed prompt data and database to us?

Thank you and look forward to hearing back from you!

jianguoz commented 7 months ago

@HowieHwong Appreciate for sharing the datasets!

Can we know how to get the experimental results in Table 3 and Table 4 after we finish running sh src/generation/run.sh? It seems that there are no instructions in the ReadMe.

Thanks:)

jianguoz commented 7 months ago

@HowieHwong Another issue is that in https://github.com/HowieHwong/MetaTool/issues/4#issuecomment-1888824317 and also the paper, the examples for Task-1 is *5152=1030** examples. However, it is different with examples in temp_dataset, below are the statistics.

Task1.json: 1040
Task2-Subtask1.json: 995
Task2-Subtask2.json: 1800
Task2-Subtask3.json: 995
Task2-Subtask4.json: 497

Could you take a further look if there are any potential issues during generations for -all- tasks? Thanks

jianguoz commented 7 months ago

Hi @HowieHwong, good afternoon! just to check if there any update towards the raised issues?

HowieHwong commented 7 months ago

Hi @HowieHwong, good afternoon! just to check if there any update towards the raised issues?

hi, sorry for late reply! I will check the number of dataset in task 1 and reply to you as soon as possible. For the results in Table 3 & 4, we haven’t uploaded the metric calculation code. If it’s needed, I will update it when I’m free. Another way to get the results is using ChatGPT to automatically extract the answer. You can read the introduction in appendix in our paper.

jianguoz commented 7 months ago

@HowieHwong Thanks for your response:). Look forward to your update on the number of dataset. We also find that it takes couple of days to finish generation even with a llama-7b-chat model with GPU.

Besides, it seems that it is very hard to reproduce the metrics based on descriptions in C.2 ANSWER MATCHING . Therefore, to fairly compare with baselines, we would appreciate it if you can update the calculation code.

HowieHwong commented 7 months ago

We also find that it takes couple of days to finish generation even with a llama-7b-chat model with GPU.

We generated the results of Llama-7b-chat with GPU within several hours. The device we used is: 2*A800 80g