Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
We noticed that tensor copy-out (device--> cpu) on PyTorch takes much longer than using Neuron Runtime native API (nrt_tensor_write, nrt_tensor_read).
The key problem is that on read, the XLATensor::ToTensor routine copies into a so-called "literal" and then into a CPU tensor. There needs to be improvement for copy-out from FAL.
We noticed that tensor copy-out (device--> cpu) on PyTorch takes much longer than using Neuron Runtime native API (
nrt_tensor_write
,nrt_tensor_read
).The key problem is that on read, the
XLATensor::ToTensor
routine copies into a so-called "literal" and then into a CPU tensor. There needs to be improvement for copy-out from FAL.See attachment: