Closed grayJiaaoLi closed 1 month ago
Here is a comparation between some prominated base LLM models(some already finetuned marked with *). In selecting the models the followings are considered:
Considering the above the proposed models to start with would be:
Model | Model size | HuggingFace Avg | ARC | HellaSwag | MMLU | HumanEval | AGIEval(chat) | license |
---|---|---|---|---|---|---|---|---|
Gemma | 7b | 64.3 | 61 | 82.5 | 66 | 32.3 | 41.7 | gemma(Allows redestribution with noting some text) |
Llama3 | 70b | 77.8 | 71.42 | 85.7 | 80 | 81.7 | 63 | Llama(Allows redestribution with noting some text) |
Llama3-instruct | 8b | 66.8 | 60.7 | 78.5 | 67.07 | 62.2 | Llama(Allows redestribution with noting some text) | |
Mistral | 7b | 61 | 60 | 83 | 64 | Apache 2.0 | ||
Calme-7B-Instruct-v0.9* | 7b | 76 | 73 | 89 | 64 | Apache 2.0 | ||
Mixtral-8x22b-Instruct* | 141b | 79.1 | 72.7 | 89 | 77.7 | Apache 2.0 | ||
Zephyr-orpo-141b-A35b* | 141b | NA | NA | NA | NA | NA | 44.16 | Apache 2.0 |
Usually databases are in a .json format with the following form: { "prompt": "What are the three most important things to consider when deciding what technology to use to build an assist device to help an elderly person with basic needs?", "response": "To build an assistive device to help an elderly person with basic needs, one must consider three crucial things:...",}
So we need to extract questions and answer pairs from our data files to be able to train LLM. The challenge will be would the generated Q&As covers all of our text data?
User story
Acceptance criteria
The evaluation process fulfills the following requirements:
The evaluation process is in align with project requirements and focuses on the abilities:
Question Answering
Text Generation
The evaluation process covers at least one of the following perspective:
General criteria (Evaluation Matrix, Scoring System, ...)
Quantitative benchmark test (ARC, HellaSwag, MMLU, ...)
The process can refer to Eleuther AI Language Model Evaluation Harness and HuggingFace LLM Leaderboard
DoD general criteria