-
-
At least for debugging purposes it would be useful to be able to query the the size of the dynamic shared memory from within the device code.
In CUDA this can be done with some inline PTX (see http…
-
The main reason for this request is to improve error handling. When using , CUB currently has to call [cudaPeekAtLastError](https://github.com/NVIDIA/cub/blob/866c576c118ae036fb5c2759ba1e5997967e817c/…
-
In the code at [this link](https://github.com/Dao-AILab/flash-attention/blob/main/csrc/flash_attn/src/flash_fwd_kernel.h#L180), the line reads:
```
Tensor tOrVt = thr_mma.partition_fragment_B(sVtNoS…
-
l20 is modified from the H100 architecture and also has FP8 capability. Does flash-attn3 support it?
-
Hello,
I try to map 10K sequences of size 10K bases each [(input)](https://drive.google.com/file/d/1m_uoL-0ICD2b8uDWjsIZhHoG9f3hFcBZ/view?usp=sharing)
I execute like this: ./bwa-mem2 mem -t 1 pre…
-
Is it font big enough? Maybe we should maintain a version that's about as big as GohuFont 14?
Preview: https://0x0.st/sMEm.png
-
# Summary
GpuIndexIVFScalarQuantizer with scalar quantizers that require shared memory on the GPU don't seem to work for k >= 1024 in Faiss 1.7.4. See the small reproduction script at the bottom.
…
-
环境
If applicable, please include the following:
CPU architecture: x86_64
GPU properties
GPU name: NVIDIA A10
Clock frequencies used: None
Libraries
TensorRT branch: 9.0.0
TensorRT LLM: 0.1.3…
-
- 进程的每一段虚拟地址空间就是一个 VMA
- 发生 pagefault 的几种可能
- 进程占用多少内存,怎么算才合理,Vss Rss Pss Uss,smem 工具
- GDB 初步