Open noanti opened 1 year ago
@CStanKonrad Is there a practical example that using external Memory?
Regarding the question, the suggested implementation of kNN retrieves for each query in the memory layer k most matching keys from the memory cache. In the 3B model, there are 3 memory layers, each having 32 heads, which gives 96 retrievals per token. In general, we recommend using the brute force approach (full attention - no kNN; an example of such an approach is implemented in this repository) for memories that fit on GPU. However, if you want to use Faiss you will need to tune the index manually (note that the faster Faiss indexes have a training stage and allow to balance between speed and retrieval accuracy). We currently do not provide practical examples with Faiss.
Example times obtained on 40GB A100 GPU with bfloat16 precision using code from this repository (populating of memory cache takes around 17s in this case): process 64k tokens, then generate 100 tokens: ~23s process 64k tokens, then generate 200 tokens: ~29s process 64k tokens, then generate 300 tokens ~36s process 64k tokens, then generate 400 tokens ~43s process 64k tokens, then generate 500 tokens ~50s So in case of 64k memory generation of one token is <= 0.07s (note that if you generate a lot, this time will increase as memory will increase during generation)
Got it, thanks!
If i use faiss as a Memory, during the inference,calculating each token requires 3(becase there are 3 memory attention layers) knn search, right? Will the generation speed become very slow?