Implement tokenizer in Fortran

certik commented 1 year ago

Fixes #1

TODO:

[x] Implement BPE splitting, save the data in model.dat
[x] Implement proper utf-8 encoding
[ ] Split into tokens correctly in all cases
[ ] Test on a longer text and test all combinations
[x] save byte_encoder in model.dat instead of the decoder
[ ] #30
[x] split the encoder/decoder code into tokenizer.f90
[x] read the input + tokens to generate and other parameters from input file
[x] remove encode_input.py (keep it for testing in a directory)

certik commented 1 year ago

Currently it prints:

$ ./gpt2 
Loading the model...
    done. Time:   0.106s
Model parameters:
n_vocab = 50257
n_ctx   =  1024
n_embd  =   768
n_layer =    12
n_head  =    12

Input parameters:
n_seq                =  19
n_tokens_to_generate =  20

Input tokens:
 36235 39141 18765  1143   326  9061   561   530  1110  1716   845  3665    11   475   772   339   714   407  5967
Decoded input as text:
Alan Turing theorized that computers would one day become very powerful, but even he could not imagine
 Encoded tokens
       36235       39141           0         326        9061         561         530        1110        1716         845        3665          11         475         772         339         714         407        5967

So it almost works.

certik commented 1 year ago

The tokens now agree:

$ ./gpt2 
Loading the model...
    done. Time:   0.107s
Model parameters:
n_vocab = 50257
n_ctx   =  1024
n_embd  =   768
n_layer =    12
n_head  =    12

Input parameters:
n_seq                =  19
n_tokens_to_generate =  20

Input tokens:
 36235 39141 18765  1143   326  9061   561   530  1110  1716   845  3665    11   475   772   339   714   407  5967
Decoded input as text:
Alan Turing theorized that computers would one day become very powerful, but even he could not imagine
 Encoded tokens
 36235 39141 18765  1143   326  9061   561   530  1110  1716   845  3665    11   475   772   339   714   407  5967

But the bpe function is just a stub for now, we now need to actually implement it.

certik commented 1 year ago

I think the tokenizer now works:

$ ./gpt2 
Loading the model...
    done. Time:   0.103s
Model parameters:
n_vocab = 50257
n_ctx   =  1024
n_embd  =   768
n_layer =    12
n_head  =    12

Input parameters:
n_seq                =  19
n_tokens_to_generate =  20

Input tokens:
 36235 39141 18765  1143   326  9061   561   530  1110  1716   845  3665    11   475   772   339   714   407  5967
Decoded input as text:
Alan Turing theorized that computers would one day become very powerful, but even he could not imagine
 Encoded tokens
     (Currently we use O(n) vocabulary lookup instead of O(1) -> very SLOW)
 36235 39141 18765  1143   326  9061   561   530  1110  1716   845  3665    11   475   772   339   714   407  5967
Running model...
At line 268 of file /Users/ondrej/repos/fastGPT/gpt2.f90
Fortran runtime warning: An array temporary was created for argument 'kv_cache' of procedure 'gpt2'
At line 147 of file /Users/ondrej/repos/fastGPT/gpt2.f90
Fortran runtime warning: An array temporary was created for argument 'q' of procedure 'attention'
At line 148 of file /Users/ondrej/repos/fastGPT/gpt2.f90
Fortran runtime warning: An array temporary was created for argument 'k' of procedure 'attention'
At line 149 of file /Users/ondrej/repos/fastGPT/gpt2.f90
Fortran runtime warning: An array temporary was created for argument 'v' of procedure 'attention'
 how they would be able to do so.

"I think that the most important thing is
    done. Time:   0.331s (1.01x)
Output tokens:
   703   484   561   307  1498   284   466   523    13   198   198     1    40   892   326   262   749  1593  1517   318
Decoded output as text:
 how they would be able to do so.

"I think that the most important thing is

Including for utf-8 input.

certik / fastGPT

Implement tokenizer in Fortran #34