ggerganov / ggml

Tensor library for machine learning
MIT License
11.25k stars 1.05k forks source link

increase cuda_cpy block size #996

Closed bssrdf closed 4 weeks ago

bssrdf commented 1 month ago

This PR gives a small performance boost to cuda backend cpy op. All test cases in test-backend-ops passed.

bssrdf commented 4 weeks ago

Thanks for approving. I read somewhere that vectorized global memory read does not improve throughput much. It helps with loading from shared memory.