Closed bssrdf closed 4 weeks ago
This PR gives a small performance boost to cuda backend cpy op. All test cases in test-backend-ops passed.
Thanks for approving. I read somewhere that vectorized global memory read does not improve throughput much. It helps with loading from shared memory.
This PR gives a small performance boost to cuda backend cpy op. All test cases in test-backend-ops passed.