Sign consensus don't return `all True mask`

arcee-ai / mergekit

Tools for merging pretrained large language models.

GNU Lesser General Public License v3.0

4.11k stars 357 forks source link

Sign consensus don't return `all True mask` #231

Closed NeonBohdan closed 3 months ago

NeonBohdan commented 3 months ago

I desided to debug next merge yaml

Clearly it skips sparsify and scales down task vector with 0.75 weight

But it also does sign consensus. On only one task vector it should be all True simply

Instead some prescentage of them are False bfloat16: 7% float32: 1e-6%

Even float32 can't manage to do all True

Can this code that is responsible for this been updated to have higher accuracy https://github.com/arcee-ai/mergekit/blob/9a541798231dc4c1e088caf271b04474685e4dcb/mergekit/merge_methods/generalized_task_arithmetic.py#L196

models:
  - model: mlabonne/NeuralHermes-2.5-Mistral-7B
    parameters:
      density: 1.0
      weight: 0.75
merge_method: dare_ties
base_model: mistralai/Mistral-7B-v0.1
dtype: float32

NeonBohdan commented 3 months ago

Already fixed with https://github.com/arcee-ai/mergekit/pull/186

NeonBohdan commented 3 months ago

With this PR https://github.com/arcee-ai/mergekit/pull/186 bfloat16: 7%->7% float32: 1e-6% -> 1e-7%

Imprived situation for float32 but not a bit for bfloat16 It's still very high, for 1 model "merges" 7% will be just droped

Can be optimized https://github.com/arcee-ai/mergekit/blob/9a541798231dc4c1e088caf271b04474685e4dcb/mergekit/merge_methods/generalized_task_arithmetic.py#L1

It's always possible to use dare_linear or task_arithmetic, but still can math accuracy of it been improved?

cg123 commented 3 months ago

I suspect what's happening here is that some of the values in the task vector are exactly zero. The sign function returns 0 for an input of 0. This shouldn't be a problem - the end result is the same either way, as we'd just be multiplying 0 by +-1 instead of 0 by 0.

This would explain the difference between bfloat16 and float32 as well - the lower precision makes it much more likely.