Use GGUF to store model weights

Here are the two lowest models. 124M:

$ gguf-dump model_fastgpt_124M_v2.gguf
* Loading: model_fastgpt_124M_v2.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 4 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 22
      3: UINT64     |        1 | GGUF.kv_count = 1
      4: INT32      |        1 | general.data_offset = 1088

* Dumping 22 tensor(s)
      1:         12 |    12,     1,     1,     1 | I32     | header
      2:   38597376 |   768, 50257,     1,     1 | F32     | wte
      3:     786432 |   768,  1024,     1,     1 | F32     | wpe
      4:   28311552 |  3072,   768,    12,     1 | F32     | mlp_fc_w
      5:      36864 |  3072,    12,     1,     1 | F32     | mlp_fc_b
      6:   28311552 |   768,  3072,    12,     1 | F32     | mlp_proj_w
      7:       9216 |   768,    12,     1,     1 | F32     | mlp_proj_b
      8:   21233664 |  2304,   768,    12,     1 | F32     | attn_w
      9:      27648 |  2304,    12,     1,     1 | F32     | attn_b
     10:    7077888 |   768,   768,    12,     1 | F32     | attn_proj_w
     11:       9216 |   768,    12,     1,     1 | F32     | attn_proj_b
     12:       9216 |   768,    12,     1,     1 | F32     | ln1_b
     13:       9216 |   768,    12,     1,     1 | F32     | ln1_g
     14:       9216 |   768,    12,     1,     1 | F32     | ln2_b
     15:       9216 |   768,    12,     1,     1 | F32     | ln2_g
     16:        768 |   768,     1,     1,     1 | F32     | lnf_b
     17:        768 |   768,     1,     1,     1 | F32     | lnf_g
     18:      50258 | 50258,     1,     1,     1 | I32     | idx
     19:     356735 | 356735,    1,     1,     1 | I8      | decoder_txt
     20:      50002 | 50002,     1,     1,     1 | I32     | vocab_idx
     21:     406304 | 406304,    1,     1,     1 | I8      | vocab_txt
     22:        256 |   256,     1,     1,     1 | I32     | byte_decoder

and 355M:

$ gguf-dump model_fastgpt_355M_v2.gguf 
* Loading: model_fastgpt_355M_v2.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 4 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 22
      3: UINT64     |        1 | GGUF.kv_count = 1
      4: INT32      |        1 | general.data_offset = 1088

* Dumping 22 tensor(s)
      1:         12 |    12,     1,     1,     1 | I32     | header
      2:   51463168 |  1024, 50257,     1,     1 | F32     | wte
      3:    1048576 |  1024,  1024,     1,     1 | F32     | wpe
      4:  100663296 |  4096,  1024,    24,     1 | F32     | mlp_fc_w
      5:      98304 |  4096,    24,     1,     1 | F32     | mlp_fc_b
      6:  100663296 |  1024,  4096,    24,     1 | F32     | mlp_proj_w
      7:      24576 |  1024,    24,     1,     1 | F32     | mlp_proj_b
      8:   75497472 |  3072,  1024,    24,     1 | F32     | attn_w
      9:      73728 |  3072,    24,     1,     1 | F32     | attn_b
     10:   25165824 |  1024,  1024,    24,     1 | F32     | attn_proj_w
     11:      24576 |  1024,    24,     1,     1 | F32     | attn_proj_b
     12:      24576 |  1024,    24,     1,     1 | F32     | ln1_b
     13:      24576 |  1024,    24,     1,     1 | F32     | ln1_g
     14:      24576 |  1024,    24,     1,     1 | F32     | ln2_b
     15:      24576 |  1024,    24,     1,     1 | F32     | ln2_g
     16:       1024 |  1024,     1,     1,     1 | F32     | lnf_b
     17:       1024 |  1024,     1,     1,     1 | F32     | lnf_g
     18:      50258 | 50258,     1,     1,     1 | I32     | idx
     19:     356735 | 356735,    1,     1,     1 | I8      | decoder_txt
     20:      50002 | 50002,     1,     1,     1 | I32     | vocab_idx
     21:     406304 | 406304,    1,     1,     1 | I8      | vocab_txt
     22:        256 |   256,     1,     1,     1 | I32     | byte_decoder

certik / fastGPT

Use GGUF to store model weights #69