tensor copy out too slow (XLATensor::ToTensor)

aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services

https://aws.amazon.com/machine-learning/neuron/

Other

444 stars 148 forks source link

tensor copy out too slow (XLATensor::ToTensor) #860

Closed aws-liuliiily closed 5 months ago

aws-liuliiily commented 6 months ago

We noticed that tensor copy-out (device--> cpu) on PyTorch takes much longer than using Neuron Runtime native API (nrt_tensor_write, nrt_tensor_read).

The key problem is that on read, the XLATensor::ToTensor routine copies into a so-called "literal" and then into a CPU tensor. There needs to be improvement for copy-out from FAL.

See attachment: waldronn ToTensor