Closed slaren closed 7 months ago
Issue 1 should be solved by checking if the op
is GGML_OP_NONE
:
This indicates that the tensor is not the result of some operator.
In theory, it can be an optimization parameter (this is not relevant for inference, but we should check anyway).
A tensor becomes an optimization parameter by marking it with the ggml_set_param()
function:
In this case, the grad
member becomes non-NULL
.
Therefore, the CUDA code should check: tensor->op == GGML_OP_NONE && tensor->grad == NULL
to determine if the tensor is a constant.
Will think more about issue 2
Issue 1 should be solved by checking if the op is
GGML_OP_NONE
I think this would work for llama.cpp, but I think it may fail with out-of-graph, but not constant tensors. For example, the k/v cache. In our case it should still work because the k/v tensors go through some view/reshape/permute operations, so technically it isn't interpreted as a constant by the time it reaches the mat mul. But I think it wouldn't be reliable in every case.
In theory, it can be an optimization parameter (this is not relevant for inference, but we should check anyway). A tensor becomes an optimization parameter by marking it with the ggml_set_param() function
This sounds exactly like what we need, but from what I see, ggml_set_param
duplicates the tensor in tensor->grad
. Wouldn't this double memory usage?
For example, the k/v cache
Good point. We can add an enum value GGML_NOP
or GGML_OP_INP
or GGML_OP_VAR
(not sure what is the best name) and you would set the op
member of such tensors to that, just to mark that it's value can change externally.
But probably we need some better interface for that, so the developer does not forget to do it.
The ggml_set_param
is not suitable because as you noted it creates the grad
tensor which doubles the memory.
Maybe we can instead introduce ggml_set_var()
or ggml_set_inp()
that just sets the op
member to that new enum value
An alternative solution is adding a bool is_var
flag to the tensor instead of the enum value. This has the advantage of not having to handle the new enum value in the graph
functions. The GPU code can then check:
tensor->op == GGML_OP_NONE && tensor->grad == NULL && !tensor->is_var
to determine if the tensor is constant. And we can also update ggml_set_param()
to also set tensor->is_var = true
so the GPU check would become just: tensor->op == GGML_OP_NONE && !tensor->is_var
Regarding issue 2:
The is_dirty
flag + update of all ggml
functions does not seem like a good solution. It is indeed intrusive and error-prone.
I am thinking about adding the is_dirty
flag to the tensor, but require the dev to set it manually for each tensor (e.g. ggml_set_dirty()
). The flags will be cleared on every ggml_graph_compute()
call by the main implementation. The GPU code will make use of this as you specified. This will allow more fine-grained control compared to the global flag where you would have to update all tensors even if only one has changed.
What if we set is_dirty
automatically in all the dst
tensors in ggml_graph_compute
or ggml_compute_forward
? This may fail with tensors that reuse the data of a different tensor, like views or reshapes, but that could be solved by adding a ggml_tensor * owner
field that would be set by these operations.
Could work, but you will still need the ggml_set_dirty()
because it is likely that we will want to sometimes modify a tensor by writing to the data
directly, instead of going through ggml
ops.
Also, it is not obvious when you will clear the flag.
You cannot clear it at the end of ggml_graph_compute()
because some tensor could have just became dirty.
You cannot clear the src
s dirty flags in ggml_compute_forward()
because they might be part of later ops.
So we will need to maintain a separate of "new dirty" flags for each compute I think.
The flag would be cleared in the GPU code after uploading the weights to VRAM. The logic would be something like this:
if (!cached || t->is_dirty || (t->owner && t->owner->is_dirty)) {
upload_tensor(t);
t->is_dirty = false;
}
The owner
could probably be avoided by traversing the sources of a tensor, along with their sources, etc. The logic is that if a tensor is a result of an operation involving a dirty tensor, then it is also dirty.
Let's start with manually setting is_dirty
via ggml
API and clearing it after ggml_graph_compute()
.
The automatic is_dirty
detection could be added after that, but at the moment I feel it is an unnecessary complication.
Will think about it more though
Could it be done on a higher level from llama.cpp? Just like it manages the scratch buffers or the KV cache. It also knows exactly what order the weights are used and it could start loading the next layer's weights ahead of time, or even it could do that in the end: load the first ones again for the next evaluation.
llama.cpp builds a graph which is then executed by ggml, it is not synchronous, so these instructions to "prefetch" the next layer weights would have to be in the graph, which would require adding GPU-specific ops to ggml. Eventually, I think what would be the fastest is hybrid GPU/CPU inference, in which some layers are processed entirely in the GPU (as much as the available VRAM allows) and the rest in the CPU. But currently that would have to be in a fork.
It could be useful to prefetch data in other environments as well, low RAM devices: fetch from disk, low disk space: fetch from network, fetch from database.
It could also be used to dequantize some formats where the whole tensor has to be processed instead of one row.
It's a thought I had.
I think it's a good idea, I am just not sure if it would fit in ggml as long as the goal is to keep GPU stuff separated.
Currently, the weights are uploaded to VRAM every time that they are used. This usually represents ~30% of the time spent in most mat muls with cuBLAS:
This could be improved by keeping the weights VRAM, however there are two issues blocking this:
Issue 1 can be solved in a not very clean way with #1268 by looking at the matrices that have names ending in ".weight". We could also add some kind of
is_weight
flag toggml_tensor
, but that would be more intrusive.Issue 2 is more complicated. We would need either to add some kind of
is_dirty
flag toggml_tensor
that would be set automatically by the operations that modify a tensor (such asggml_cpy
and_inplace
ops), or we could add a global flag toggml-cuda
that would trigger a full weight re-upload, that would need to be set by the application when the weights change.The
is_dirty
flag would be more intrusive to ggml, essentially we would be adding GPU-specific details to ggml, but would be automatic for downstream users. The global CUDA-only flag would be less intrusive to ggml, but would force users to deal with this manually.Any thoughts about this? What would be the best way to add this to ggml, while not interfering with the general goal of not adding GPU-specific details to ggml?