Conduct Largue Langage Base Model Selection Process

User story

As a Software Developer
I need to select a Large Language Base Model based on requirements of the project by a LLM selection process
So that our created LLM will perform best.

Acceptance criteria

The evaluation process fulfills the following requirements:
- The evaluation process is in align with project requirements and focuses on the abilities:
- Question Answering
- Text Generation
- The evaluation process covers at least one of the following perspective:
- General criteria (Evaluation Matrix, Scoring System, ...)
- Quantitative benchmark test (ARC, HellaSwag, MMLU, ...)
- The process can refer to Eleuther AI Language Model Evaluation Harness and HuggingFace LLM Leaderboard
Evaluation should be performed on at least 3 base models
The evaluating comparison contains at least a statement for each LLM:
- language supporting
- license
- model size
- achievability
- how complex the prompts to train with the LLM are (how complex will it be for the devs to work with the base model?)
- how a new dataset must look like, so that the base model can be retrained with the new dataset (which requirements given our use case has the base model)

DoD general criteria

Feature has been fully implemented
Feature has been merged into the mainline
All acceptance criteria were met
Product owner approved features
All tests are passing
Developers agreed to release

Comparation of candidate models

Here is a comparation between some prominated base LLM models(some already finetuned marked with *). In selecting the models the followings are considered:

The model should not be hug(easier to finetune). As the model parameters increases the amount of data(number of data rows required for finetuning increases). However, most of these models have also larger version. If we see we can gather enough data we can opt for them.
Selecting models which have a active community/famous. If we faced a error in future we would have somewhere to discuss it.
HumanEval benchmark for programming. Since, we are dealing with great number of .yml files which form project pipelines, it would be crutioal our model be able to preform well in programming tasks(Humaneval benchmark)
Benchmarks may not be fully accurate: In HuggingFace communites, people was mentioning low performance on models with great benchmarks.

Considering the above the proposed models to start with would be:

Llama3_8b
Gemma_7b
Llama3_70b

Model	Model size	HuggingFace Avg	ARC	HellaSwag	MMLU	HumanEval	AGIEval(chat)	license
Gemma	7b	64.3	61	82.5	66	32.3	41.7	gemma(Allows redestribution with noting some text)
Llama3	70b	77.8	71.42	85.7	80	81.7	63	Llama(Allows redestribution with noting some text)
Llama3-instruct	8b	66.8	60.7	78.5	67.07	62.2		Llama(Allows redestribution with noting some text)
Mistral	7b	61	60	83	64			Apache 2.0
Calme-7B-Instruct-v0.9*	7b	76	73	89	64			Apache 2.0
Mixtral-8x22b-Instruct*	141b	79.1	72.7	89	77.7			Apache 2.0
Zephyr-orpo-141b-A35b*	141b	NA	NA	NA	NA	NA	44.16	Apache 2.0

Database form:

Usually databases are in a .json format with the following form: { "prompt": "What are the three most important things to consider when deciding what technology to use to build an assist device to help an elderly person with basic needs?", "response": "To build an assistive device to help an elderly person with basic needs, one must consider three crucial things:...",}

So we need to extract questions and answer pairs from our data files to be able to train LLM. The challenge will be would the generated Q&As covers all of our text data?

amosproj / amos2024ss08-cloud-native-llm