-
Currently the nvq++ driver is a skeletal bash shell script that runs the various components that comprise the logical, piecewise steps of a nvq++ compilation. The bash script is very easy to update an…
-
nvcc -g -std=c++11 -I`python -c "import tensorflow; print(tensorflow.sysconfig.get_include())"` -I"/usr/local/cuda-8.0/include" -DGOOGLE_CUDA=1 -D_MWAITXINTRIN_H_INCLUDED -D_FORCE_INLINES -D__STRICT_A…
-
Hi !
In your paper, you mentioned that including text-only data in training is crucial for maintaining language abilities. I'm currently performing full fine-tuning using LLaMA Factory, and I'm enc…
-
Is "sycl::complex" mentioned in the SYCL specification ? Which type is recommended for Intel, AMD, and NVIDIA GPUs ? Thanks.
```
no template named 'complex' in namespace 'sycl'; did you mean 'std:…
-
**Which documentation should be updated?**
How to implement custom operators should be documented and the documentation should address things like:
1. How to support broadcasting the operation over …
-
Hi Andrej, this implementation is fantastic!
In your view, what would be the main design trade-offs if one were to re-implement the C code that is intended to run on the CPU in modern C++? By moder…
-
I have successfully built the selective_scan_cuda function. However, when I call the function, I encounter the following error. Based on the information I found online, it appears that my GPU is too o…
-
### Is this a duplicate?
- [X] I confirmed there appear to be no [duplicate issues](https://github.com/NVIDIA/cccl/issues) for this request and that I agree to the [Code of Conduct](CODE_OF_CONDUCT…
-
### Feature request
The current flash attention 2 integration is sub-optimal in performance because it requires unpadding and padding the activations on **each** layer. For example in llama impleme…
-
I notice this project is inspired by stream-k, how is the work done?
I notice the lean attention uses stream k for attention, is this supported in flash infer?