Finetuning not using Metal/Apple Silicon GPU

fakerybakery commented 1 year ago

Hi, I'm running the latest version of llama.cpp (cloned yesterday from the Git repo) on macOS Sonoma 14.0 on a M1 MacBook Pro. I tried to finetune a Llama model, and the training worked, however it was extremely slow and Activity Monitor did not indicate any GPU usage for the finetuning script, although it was using most of my CPU. Here is my script:

$ ./finetune --model-base models/tinyllama/tinyllama-1.1b-intermediate-step-480k-1t.Q4_0.gguf --train-data shakespeare.txt --lora-out shakespeare.gguf --save-every 0 --threads 18 --ctx 32 --batch 32 --use-checkpointing --use-flash --sample-start "\n" --escape

Here is the console output:

main: seed: 1698285516
main: model base = 'models/tinyllama/tinyllama-1.1b-intermediate-step-480k-1t.Q4_0.gguf'
llama_model_loader: loaded meta data with 19 key-value pairs and 201 tensors from models/tinyllama/tinyllama-1.1b-intermediate-step-480k-1t.Q4_0.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                    output.weight q6_K     [  2048, 32000,     1,     1 ]
llama_model_loader: - tensor    1:                token_embd.weight q4_0     [  2048, 32000,     1,     1 ]
llama_model_loader: - tensor    2:           blk.0.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor    6:         blk.0.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor    8:            blk.0.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor    9:              blk.0.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   10:            blk.0.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor   11:           blk.1.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   13:              blk.1.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   14:              blk.1.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   15:         blk.1.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   17:            blk.1.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   18:              blk.1.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   19:            blk.1.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor   20:           blk.2.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   22:              blk.2.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   23:              blk.2.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   24:         blk.2.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   25:            blk.2.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   26:            blk.2.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   27:              blk.2.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   28:            blk.2.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor   29:           blk.3.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   31:              blk.3.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   32:              blk.3.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   33:         blk.3.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   34:            blk.3.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   35:            blk.3.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   36:              blk.3.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   37:            blk.3.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor   38:           blk.4.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   40:              blk.4.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   41:              blk.4.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   42:         blk.4.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   43:            blk.4.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   44:            blk.4.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   45:              blk.4.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   46:            blk.4.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor   47:           blk.5.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   49:              blk.5.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   50:              blk.5.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   51:         blk.5.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   52:            blk.5.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   53:            blk.5.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   54:              blk.5.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   55:            blk.5.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor   56:           blk.6.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   58:              blk.6.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   59:              blk.6.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   60:         blk.6.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   61:            blk.6.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   62:            blk.6.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   63:              blk.6.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   64:            blk.6.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor   65:           blk.7.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   67:              blk.7.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   68:              blk.7.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   69:         blk.7.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   70:            blk.7.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   71:            blk.7.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   72:              blk.7.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   73:            blk.7.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor   74:           blk.8.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   76:              blk.8.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   77:              blk.8.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   78:         blk.8.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   79:            blk.8.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   80:            blk.8.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   81:              blk.8.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   82:            blk.8.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor   83:           blk.9.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   85:              blk.9.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   86:              blk.9.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   87:         blk.9.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   88:            blk.9.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   89:            blk.9.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   90:              blk.9.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   91:            blk.9.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor   92:          blk.10.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   94:             blk.10.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   95:             blk.10.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor   96:        blk.10.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor   97:           blk.10.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor   98:           blk.10.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor   99:             blk.10.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  100:           blk.10.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor  101:          blk.11.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  103:             blk.11.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  104:             blk.11.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  105:        blk.11.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  106:           blk.11.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  107:           blk.11.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  108:             blk.11.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  109:           blk.11.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor  110:          blk.12.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  112:             blk.12.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  113:             blk.12.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  114:        blk.12.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  115:           blk.12.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  116:           blk.12.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  117:             blk.12.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  118:           blk.12.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor  119:          blk.13.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  121:             blk.13.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  122:             blk.13.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  123:        blk.13.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  124:           blk.13.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  125:           blk.13.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  126:             blk.13.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  127:           blk.13.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor  128:          blk.14.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  130:             blk.14.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  131:             blk.14.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  132:        blk.14.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  133:           blk.14.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  134:           blk.14.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  135:             blk.14.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  136:           blk.14.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor  137:          blk.15.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  139:             blk.15.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  140:             blk.15.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  141:        blk.15.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  143:           blk.15.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  144:             blk.15.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  145:           blk.15.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor  146:          blk.16.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  148:             blk.16.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  149:             blk.16.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  150:        blk.16.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  151:           blk.16.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  152:           blk.16.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  153:             blk.16.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  154:           blk.16.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor  155:          blk.17.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  157:             blk.17.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  158:             blk.17.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  159:        blk.17.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  160:           blk.17.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  161:           blk.17.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  162:             blk.17.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  163:           blk.17.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor  164:          blk.18.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  166:             blk.18.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  167:             blk.18.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  168:        blk.18.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  169:           blk.18.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  170:           blk.18.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  171:             blk.18.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  172:           blk.18.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor  173:          blk.19.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  175:             blk.19.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  176:             blk.19.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  177:        blk.19.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  178:           blk.19.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  179:           blk.19.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  180:             blk.19.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  181:           blk.19.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor  182:          blk.20.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  184:             blk.20.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  185:             blk.20.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  186:        blk.20.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  187:           blk.20.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  188:           blk.20.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  189:             blk.20.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  190:           blk.20.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor  191:          blk.21.attn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_q.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  193:             blk.21.attn_k.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  194:             blk.21.attn_v.weight q4_0     [  2048,   256,     1,     1 ]
llama_model_loader: - tensor  195:        blk.21.attn_output.weight q4_0     [  2048,  2048,     1,     1 ]
llama_model_loader: - tensor  196:           blk.21.ffn_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - tensor  197:           blk.21.ffn_gate.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  198:             blk.21.ffn_up.weight q4_0     [  2048,  5632,     1,     1 ]
llama_model_loader: - tensor  199:           blk.21.ffn_down.weight q4_0     [  5632,  2048,     1,     1 ]
llama_model_loader: - tensor  200:               output_norm.weight f32      [  2048,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                          general.file_type u32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv  18:               general.quantization_version u32
llama_model_loader: - type  f32:   45 tensors
llama_model_loader: - type q4_0:  155 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_layer          = 22
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 1.10 B
llm_load_print_meta: model size       = 606.53 MiB (4.63 BPW)
llm_load_print_meta: general.name   = py007_tinyllama-1.1b-intermediate-step-480k-1t
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.07 MB
llm_load_tensors: mem required  =  606.59 MB
......................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =   11.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '<REDACTED>llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x13fe07f20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_add_row                        0x13fe088b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul                            0x13fe090d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_row                        0x13fe09980 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_scale                          0x13fe0a1c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_scale_4                        0x13fe0aa10 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_silu                           0x13fe0b1c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_relu                           0x13fe0b970 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_gelu                           0x13fe0c120 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max                       0x13fe0c650 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max_4                     0x13fe0cb80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf                  0x13fe0d220 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf_8                0x13fe0d7b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f32                   0x13fe0dce0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f16                   0x13fe0e210 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x141106640 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x141106b70 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_0                  0x1411070a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_1                  0x1411075d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q8_0                  0x141107d90 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x1411082c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x1411087f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x13fe0e620 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x13fe0ecd0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x13fe0f200 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rms_norm                       0x13fe0f730 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_norm                           0x13fe0fc60 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f32_f32                 0x13fe10390 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f16_f32                 0x13fe108c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f16_f32_1row            0x13fe10e50 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_f16_f32_l4              0x13fe115f0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q4_0_f32                0x13fe11b20 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q4_1_f32                0x13fe12050 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q5_0_f32                0x13fe12580 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q5_1_f32                0x141204e40 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q8_0_f32                0x141205850 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q2_K_f32                0x141205d80 | th_max =  640 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q3_K_f32                0x1412062b0 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q4_K_f32                0x1412067e0 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q5_K_f32                0x141206d10 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mv_q6_K_f32                0x141207240 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f32_f32                 0x141207770 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f16_f32                 0x1411079e0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32                0x141108ea0 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32                0x1411093d0 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_0_f32                0x141109900 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_1_f32                0x141109e30 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q8_0_f32                0x14110a670 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32                0x14110aba0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32                0x14110b0d0 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32                0x14110b600 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32                0x14110bb30 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32                0x14110c060 | th_max =  768 | th_width =   32
ggml_metal_init: loaded kernel_rope_f32                       0x14110c590 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rope_f16                       0x14110cac0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_alibi_f32                      0x14110cff0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x14110d520 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x14110da50 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x14110df80 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_concat                         0x14110e4b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_sqr                            0x14110ec60 | th_max = 1024 | th_width =   32
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 49152.00 MB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 72.63 MB
llama_new_context_with_model: max tensor size =    51.27 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =   607.23 MB, (  607.86 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =    11.02 MB, (  618.88 / 49152.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =    66.52 MB, (  685.39 / 49152.00)
main: init model
print_params: n_vocab:   32000
print_params: n_ctx:     32
print_params: n_embd:    2048
print_params: n_ff:      5632
print_params: n_head:    32
print_params: n_head_kv: 4
print_params: n_layer:   22
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 10000.000000
print_params: rope_freq_scale       : 1.000000
print_lora_params: n_rank_attention_norm : 1
print_lora_params: n_rank_wq             : 4
print_lora_params: n_rank_wk             : 4
print_lora_params: n_rank_wv             : 4
print_lora_params: n_rank_wo             : 4
print_lora_params: n_rank_ffn_norm       : 1
print_lora_params: n_rank_w1             : 4
print_lora_params: n_rank_w2             : 4
print_lora_params: n_rank_w3             : 4
print_lora_params: n_rank_tok_embeddings : 4
print_lora_params: n_rank_norm           : 1
print_lora_params: n_rank_output         : 4
main: total train_iterations 0
main: seen train_samples     0
main: seen train_tokens      0
main: completed train_epochs 0
main: lora_size = 28433632 bytes (27.1 MB)
main: opt_size  = 42223216 bytes (40.3 MB)
main: opt iter 0
main: input_size = 131076128 bytes (125.0 MB)
main: compute_size = 4010811616 bytes (3825.0 MB)
main: evaluation order = LEFT_TO_RIGHT
main: tokenize training data
tokenize_file: warning: found 2176 samples (min length 1) that are shorter than context length of 32.
tokenize_file: warning: found 293 empty samples.
tokenize_file: total number of samples: 2469
main: number of training tokens: 24798
main: number of unique tokens: 3131
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 2305192 bytes (2.2 MB)
train_opt_callback: iter=     0 sample=1/2469 sched=0.000000 loss=0.000000 |->
train_opt_callback: iter=     1 sample=33/2469 sched=0.010000 loss=14.714114 dt=00:01:04 eta=04:32:13 |->
<...>

Is Metal supported on finetuning yet? If so, is there any configuration needed to get it to work? Thanks in advance!

QueryType commented 1 year ago

I don't think metal is supported, (yet).

fakerybakery commented 1 year ago

@QueryType Thanks! I see Metal is on the roadmap.

QueryType commented 1 year ago

Yes, it is, but low priority. Currently when I tried on mac, it started to saturate the CPUs and trigger swap. I am trying to find the best combination of parameters to avoid that. GPU support would indeed be welcome.

Crear12 commented 12 months ago

Yes, it is, but low priority. Currently when I tried on mac, it started to saturate the CPUs and trigger swap. I am trying to find the best combination of parameters to avoid that. GPU support would indeed be welcome.

Totally rookie here, this is my command for using GPU on M1 Max 64GB, no swap was used:

./main -t 10 -ngl 32 -m ./models/airoboros-65B-gpt4-1.2.ggmlv3.q3_K_L.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "USER: What does llama do?\nASSISTANT:"

And here's the speed:

llama_print_timings:        load time =   53395.29 ms
llama_print_timings:      sample time =       3.72 ms /    41 runs   (    0.09 ms per token, 11030.40 tokens per second)
llama_print_timings: prompt eval time =     754.11 ms /    15 tokens (   50.27 ms per token,    19.89 tokens per second)
llama_print_timings:        eval time =   10442.69 ms /    40 runs   (  261.07 ms per token,     3.83 tokens per second)
llama_print_timings:       total time =   11208.28 ms
ggml_metal_free: deallocating

Hope it helps!

fakerybakery commented 12 months ago

Hi @Crear12 thanks for your info. I think your info is only for inference and not finetuning :) but thanks for sharing!

QueryType commented 12 months ago

I opened a bug, https://github.com/ggerganov/llama.cpp/issues/3911 after -ngl was introduced in finetune. So still unable to use practically. Maybe you can have a look.

jaslatendresse commented 11 months ago

@fakerybakery Hey I'm interested in doing the same on Macbook Pro M1 and I was wondering if you could share what your data looks like? And what documentation you used to come up with your script with the different parameters explained. Thanks.

marabgol commented 10 months ago

hey guys, while I keep an eye on the development of finetune for Metal, is there any example to finetune foe classification and not pure text ?

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp

Finetuning not using Metal/Apple Silicon GPU #3799