EQ-bench / EQ-Bench

A benchmark for emotional intelligence in large language models
MIT License
195 stars 17 forks source link

model test request #20

Closed dnhkng closed 6 months ago

dnhkng commented 8 months ago

Now that the llama.cpp server is running correctly, would it be possible to have this model tested?

https://huggingface.co/Infinimol/miiqu-gguf using ChatML format and context length >= 1024, please :)

It is a model I been working on it for some time, and I think it's interesting. It is not a fine-tune, but a merge, and I find it consistently scores higher than the base model (miqu), which I think is a first for a pure merge model. Eq-bench runs in about 15 mins on an A100.

The model is GGUF, but split to fit under the 50Gb limit on Huggingface, but the model card give the one-liner to reassemble the file.

sam-paech commented 8 months ago

Very cool! I'll try to put it through its paces today.

Can I ask what the merge params were?

sam-paech commented 8 months ago

Btw are you able to upload the full weights? Lm-eval (what I use for the MAGI metric) is super slow through llama.cpp

dnhkng commented 8 months ago

No full weights. I figured out how to model merge directly from EXL2 to avoid requantization; this can also be done dynamically (this is my reddit post on it), but it needs a modified version of ooba. I am also working together with a llama.cpp developer on a PR for the same for GGUF.

I also don't have the full weights, because I tried over 6000 permutation, and without dynamic merges it would be months of compute 😅

This particular merge gives me an average score of almost 84, using dynamic merging of EXL2. The static merges of EXL2 and GGUF give ~38.4. Not sure why it's slightly lower, I need to look into it. But when I started building the GGUF merging pipeline, I discovered the backend issue #16

If full weights are needed, I can try and make some, but I think it will take a few hours time to download, merge and upload to Huggingface. Before I start that, would using the EXL2 version be faster? https://huggingface.co/Infinimol/miiqu-exl2

sam-paech commented 8 months ago

Fair enough. Oh, that's cool that you can do dynamic merges!

I don't think eleuther eval harness supports EXL2. With gguf it said it was going to take 73 hours (on a A100) so that's not gonna work :(

If you are able to make the full weights, I'm sure others would make use of them as well if your model gets popular.

dnhkng commented 8 months ago

OK, if it takes 73 hours, then I'll make them for you :)

Did you get the eq-eval test done?

sam-paech commented 8 months ago

Yeah it scored 83.17 with chatml via ooba. Do you think it would score higher if I used EXL2?

dnhkng commented 8 months ago

I'm still trying to figure out why I get differences.

I built my own evaluation system, and maybe its something useful for eq-bench v3.

It works like this: when you have a rating from 1-10, the scores are a) discrete and b) easy to mess up by the LLM.

so, I use a descriptive rating, i.e. for your system: 0: does not experience the emotion 1: subconsciously experiences the emotion, but does not recognise it 2: start to feel the emotion, but not enough to change behaviour ...

Then, I run the LLM and only select the logits for the symbols 0-9. These 10 logit values can then be converted to probabilities that sum to 1 for just these values. Lastly, you can use these probabilities as weights to get continuous score. i.e., say the model gave all zeros to the scores except for 50% for 5, and 50% for 6, then we get a score of 5.5 This has the effect of reducing the standard deviation, as the model would pick 5 half the time or 6 half the time.

dnhkng commented 8 months ago

Yeah it scored 83.17 with chatml via ooba. Do you think it would score higher if I used EXL2?

It might do, but its still lower than my dynamic merges. I will put my fork of exllamaV2 up, and add in the integration for EQ-Bench. Then you can test on that; it's all done, but I figured its too custom a case to make a PR.

The huge benefit is that it reuses GPU weights! i.e. if you can run a 70B model, you can also run the 120B self-merge, as the weights are reused :)

sam-paech commented 8 months ago

Oh I like that approach. Is that something you added to the eq-bench questions, or another test?

I did try converting the test to work with eleuther harness using logprobs evaluation, which is more or less what you are describing except the targets were just the raw numbers like [emotion]: 0, etc. It didn't work very well. Maybe the text explanation plus your score aggregation method would produce better results.

The huge benefit is that it reuses GPU weights! i.e. if you can run a 70B model, you can also run the 120B self-merge, as the weights are reused :)

That's pretty amazing. How far do you think this could be pushed, in terms of scaling up the layer reuse? If it only increases computation time of inference I would be curious to just crank up the layer count and see what happens.

dnhkng commented 8 months ago

Oh I like that approach. Is that something you added to the eq-bench questions, or another test?

This was for my own test, but it would be straightforward to add to eq-bench on backends that support logits.

I did try converting the test to work with eleuther harness using logprobs evaluation, which is more or less what you are describing except the targets were just the raw numbers like [emotion]: 0, etc. It didn't work very well. Maybe the text explanation plus your score aggregation method would produce better results.

Yes, this only works if you add a description of the values. Otherwise, the model doesn't know the ranges, and the score ranges get compressed.

The huge benefit is that it reuses GPU weights! i.e. if you can run a 70B model, you can also run the 120B self-merge, as the weights are reused :)

That's pretty amazing. How far do you think this could be pushed, in terms of scaling up the layer reuse? If it only increases computation time of inference I would be curious to just crank up the layer count and see what happens.

I've been trying this a lot, you generally can change creativity, but only a few merge variants (<0.1%) increased eq-bench scores so far.

dnhkng commented 8 months ago

@sqrkl OK, f16 is up now, at: https://huggingface.co/Infinimol/miiqu-f16

Very curious to see how it does at MAGI... Can we really make models smarter without fine tuning?

sam-paech commented 7 months ago
|Tasks|Version|Filter|n-shot| Metric |Value |   |Stderr|
|-----|------:|------|-----:|--------|-----:|---|-----:|
|magi |      1|none  |     0|acc     |0.6328|±  |0.0085|
|     |       |none  |     0|acc_norm|0.6322|±  |0.0085|

It seems at the very least you didn't make it less smart. Which seems to be a difficult thing with these frankenmerges. It's scored pretty much exactly the same as Miqu.

I ran it through the creative writing benchmark as well, it scored 65.5 compared to 69.5 for mistral-medium. It was a bit prone to hallucinations with some of the prompts which is probably why the lower score there.

dnhkng commented 7 months ago

I noticed you need to lower the temperature a lot to chat with it, but then it is seems much better than miqu.

My 'hands on' testing is staring with a character card in ooba, and getting the model to write both sides of the conversation.

With the model merges, I found it's necessary to raise minP and lower the temp to about 0.3.

Most model merges degenerate into giggling fits. One character makes a weird mistake (repeats itself or leaves out spaces between words), and the other character finds that so hilarious, they both end up laughing hysterically.

In some other merges, the characters starts alliterating or using huge streams of adjectives in every sentence (e.g. the gorgeous, greenish, graceful goose leapt loosely, lively, lovingly like the splendid, soulful shining sun...)

But with lower temperature, this model is much more creative than base Miqu, while staying in character. I will give you the best settings later today.

sam-paech commented 7 months ago

That's hilarious. Sounds like merging makes for an intoxicated miqu.

People seem to like these big merges for writing more than anything else (after all they tend to do worse on benchmarks). I'm hoping the creative writing benchmark can capture this thing that other benchmarks seem to miss.

dnhkng commented 7 months ago

That's hilarious. Sounds like merging makes for an intoxicated miqu.

Yes, it has a very 'stoned' vibe sometimes! Where the repeats come from in the model changes how the model behaves. The model infinimol/miiqu model has the least 'personality' change but improved the EQ_bench score. Other merges increased writing creativity. I plan on mixing the merges once I get some more compute.

I ran it through the creative writing benchmark as well, it scored 65.5 compared to 69.5 for mistral-medium. It was a bit prone to hallucinations with some of the prompts which is probably why the lower score there.

Yep, it needs a lower temperature to stay on track.

Will you do a write-up for MAGI? I don't plan on fine-tuning this model but only work with self-merges for the time being. So I would like to evaluate using MAGI before I submit another model.

dnhkng commented 7 months ago

Anyway, I'm still pretty happy, highest opensource model without finetuning! I might write a short paper up on the topic, referring to this result.

The EQ-Bench score was lower than in my tests. Using dynamic merges in ooba with 4 repeats, I scored: [83.36, 84.42, 83.92, 84.3]. I'll investigate why the scores are lower. Maybe my merging code messed something up for the static merge.

sam-paech commented 7 months ago

That's definitely a success! It's evidently not easy to create merges without being destructive to reasoning (and benchmarks).

I'm happy to redo the eq-bench score; I just want to make sure it's reproducible to people who are downloading weights, i.e. not having to use non-standardised inferencing methods (like dynamic merges).

I do actually have a write-up of the MAGI subset in the works. If you are inclined I would value any feedback on it (or otherwise just read at your leisure): https://docs.google.com/document/d/1A2KTDHXX7Qyuwd5HBKiZl0Pg7QpIOMRqKAZg94l3FbI/edit

dnhkng commented 7 months ago

That's definitely a success! It's evidently not easy to create merges without being destructive to reasoning (and benchmarks).

Yes, I am sure about this. I have lots to write about the topic!

I'm happy to redo the eq-bench score; I just want to make sure it's reproducible to people who are downloading weights, i.e. not having to use non-standardised inferencing methods (like dynamic merges).

I've made a PR with ExllamaV2 to have this added. But its actually a very small set of changes needed to ooba to make this work, where you have frankenmerge parameters in the UI: image This is my local fork, and I can make dynamic merges to any EXL2 model, setting the repeat range, and number of repeats. You can dynamically build frankenmerges during a single chat, and its just a few seconds to do so.

If this gets merged, using a self-frankenmerge will mean just playing with these parameters, no more downloading more models.

sam-paech commented 7 months ago

That is super cool. Is this fork public? I know some people who do a lot of merges who would be interested in playing with this.

dnhkng commented 7 months ago

That is super cool. Is this fork public? I know some people who do a lot of merges who would be interested in playing with this.

Not yet, I'll tidy up and push it this week, if you want to play with it.

I do actually have a write-up of the MAGI subset in the works. If you are inclined I would value any feedback on it (or otherwise just read at your leisure): https://docs.google.com/document/d/1A2KTDHXX7Qyuwd5HBKiZl0Pg7QpIOMRqKAZg94l3FbI/edit

Had a look, seems very relevant! Any chance you could go further, and generate an absolute minimal number of tests for a first pass? i.e. 10 tests that if you pass more than five, you are not totally useless? That would be very useful for sorting through thousands of model that get 'dumber' with merging.

sam-paech commented 7 months ago

Mm maybe. Tinybenchmarks is doing this:

https://arxiv.org/abs/2402.14992

It may be possible to push it even futher towards an ultra small test set which is maximally discriminative. But there are big tradeoffs for reducing the test set down to this level. but yeah if you are just wanting a quick indicataor, tinybenchmarks is a good place to start.

dnhkng commented 7 months ago

@sqrkl Just wrote you by email about this thread!

for Dynamic Merges, pull my repo: https://github.com/dnhkng/text-generation-webui

load the model with Exllamav2, not Exllamav2_HF, and once loaded, you can set the section of layers you want repeated.

you set 'start' and 'stop' positions and the number of 'repeats', and the model will generate a merged model of the form:

list(range(stop)) + int(repeats)*(list(range(start,stop))) + list(range(stop,num_layers))

sam-paech commented 7 months ago

So cool! I'm going to try this out.

sam-paech commented 7 months ago

Oh and do you mind if I share this?

dnhkng commented 7 months ago

Not sure if you are familiar with this: https://github.com/huggingface/hf_transfer

Very handy for 10x download speeds.

Can you give some tips on using LM_eval_harness? Im not sure how you used it for MAGI, e.g. did you use bitsandbytes?

Could you share the parameters to start the tests?

sam-paech commented 7 months ago

Yep love using hf_transfer to max out the bandwidth of those runpods. :)

lm-eval can be a bit fiddly to get working. Here's the list of stuff I paste into a runpod to get it working from scratch:

git clone https://github.com/sqrkl/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
pip install gekko sentencepiece hf_transfer einops optimum accelerate bitsandbytes tiktoken flash_attn transformers_stream_generator
export HF_HUB_ENABLE_HF_TRANSFER=1
export NUMEXPR_MAX_THREADS=64

And the lm-eval command:

lm_eval --model hf --model_args pretrained=Infinimol/miiqu-f16,load_in_4bit=True,max_length=4096,trust_remote_code=True,bnb_4bit_compute_dtype=bfloat16 --tasks magi,eq_bench --device cuda:0 --batch_size 8 --verbosity DEBUG --log_samples --output_path output/Infinimol__miiqu-f16 --use_cache sqlite_cache_Infinimol__miiqu-f16

You can load in 8bit or full weights if you have the vram for it.

The --log_samples saves the full output of the test including all the model's answers. Unnecessary for most uses but I find it's helpful for debugging sometimes.

The sqlite cache allows the test to retry if it fails.

Batch size: I usually start by setting it to auto:9, which means it will recalculate the optimal batch size 9 times as it goes along. Since lm-eval orders the test set by size (largest first), the max batch size will start out small and get bigger, so if it's going to be a long eval time it pays to have the batch size automatically resize.

The downside is that it often gets it wrong and you end up with out of memory errors. In that case you have to set it manually and use trial & error.

I've read that batch sizes > 1 can affect score negatively, but at least in my limited comparisons it's been neglible. Running lm-eval with batch size 1 is suuuuper slow.

You can run it with llama.cpp server, but it's slow because it forces batch size 1. But that might be fine if it's running locally and you can just leave it overnight. The command looks something like this:

lm_eval --model gguf --model_args base_url=http://localhost:8000 --tasks winogrande --batch_size auto:9 --verbosity DEBUG --log_samples --output_path output/miqu --use_cache sqlite_cache_miqu