Add embedding table factorization

This will be important for the next stage of vector inputs.

Currently this already has an advantage for reusing existing tokenizers, making the n_embed size and the vocab table size independent.

This is accomplished by factorizing the vocab table into a vocab table with a smaller dimension and an expansion nn.Linear (e.g. size 50k 64 matrix multiplied by a 64 384 matrix for nanoGPT).

The process is reversed in the output, reducing the number of parameters greatly for smaller models.

Though the reduction comes with a cost of some validation loss does decrease, it does provide an avenue to pursue larger tokenizers (e.g. 100k tiktoken or 256k gemma), with a smaller sized llms.

Decoupling vocab table size from the embedding vector size has another advantage for vector tokenization, where we can provide a feature engineered vocab table, and expand to n_embd with the expansion matrix.

This means that we can utilize a smaller vector that is feature engineered (say 10 - 20) for csv values and other time series values, while having a much larger embedding vector size for the main network.

This prevents our model from being bottlenecked expressiveness wise (as found with certain papers in the final embedding vector size from the final decoder block).

Otherwise 10 engineered features in a vector approach to tokenization would yield an embedding vector size of only 10, which will then highly bottlenecking the modeling capability.

ReaLLMASIC / nanoGPT

Add embedding table factorization #179