Open mortvest opened 3 years ago
Can we amortize the data transfer overheads (host -> device and device -> host with kernel execution) when running with multiple chunks using multithreading?
Can we amortize the data transfer overheads (host -> device and device -> host with kernel execution) when running with multiple chunks using multithreading?