increase cuda_cpy block size

ggerganov / ggml

Tensor library for machine learning

MIT License

11.25k stars 1.05k forks source link

Closed bssrdf closed 4 weeks ago

bssrdf commented 1 month ago

This PR gives a small performance boost to cuda backend cpy op. All test cases in test-backend-ops passed.

bssrdf commented 4 weeks ago

Thanks for approving. I read somewhere that vectorized global memory read does not improve throughput much. It helps with loading from shared memory.