TianjinYellow / EdgeDeviceLLMCompetition-Starting-Kit

36 stars 8 forks source link

error: 'no predictions found' #7

Open sriyachakravarthy opened 1 month ago

sriyachakravarthy commented 1 month ago

Hi! We tried evaluating the base models using the starting kit evaluation pipeline. Here are some points/issues:

  1. For phi2 and llama models, we are getting 'prediction not found' error.
  2. Could we please get some idea on the time taken for inference with respect to the GPU memory and model size?
TianjinYellow commented 1 month ago

Hi,

i tested the repository just now on one A100 GPU and it works well. I did not find the errors you showed.

Screenshot 2024-07-31 at 07 38 52

(1) please check whether data is downloaded and put it in the folder that contains the configs folder, human_eval folder. (2) One of the quickest ways to solve it is to uninstall the previous encompass. then git clone the repository and re-install it. (3) for the commonsense_qa task and phi2, it takes 27 GB GPU memory and around 15 minutes to evaluate it.

Best,

sriyachakravarthy commented 1 month ago

Hi! Is this the correct directory flow?

Screenshot 2024-07-31 123007
TianjinYellow commented 1 month ago

Yes. Additionally, when I run the evaluation, I set this directory as the workspace as well.

sriyachakravarthy commented 1 month ago

This directory as in '/EdgeDeviceLLMCompetition-Starting-Kit/opencompass '?

sriyachakravarthy commented 1 month ago

Hi,

i tested the repository just now on one A100 GPU and it works well. I did not find the errors you showed. Screenshot 2024-07-31 at 07 38 52

(1) please check whether data is downloaded and put it in the folder that contains the configs folder, human_eval folder. (2) One of the quickest ways to solve it is to uninstall the previous encompass. then git clone the repository and re-install it. (3) for the commonsense_qa task and phi2, it takes 27 GB GPU memory and around 15 minutes to evaluate it.

Best,

Thank you for the help! It worked for phi2. Were tests done for other models as well? We evaluated the commonsense_qa dataset on Qwen2-7B and Llama3-8B and we got 0% accuracy. Please confirm.