I noticed that the current try_from implementation for converting ndarrays into tensors is copying the underlying data. This makes interoperability between ndarray --> tensor $O(n)$, however if we implement a zero copy solution we could bring this to $O(1)$. For reference the current implementation is:
// tensor/convert.rs
impl<T, D> TryFrom<ndarray::ArrayBase<T, D>> for Tensor
where
T: ndarray::Data,
T::Elem: Element,
D: ndarray::Dimension,
{
type Error = TchError;
fn try_from(value: ndarray::ArrayBase<T, D>) -> Result<Self, Self::Error> {
Self::try_from(&value)
}
}
// ...
impl<T, D> TryFrom<&ndarray::ArrayBase<T, D>> for Tensor
where
T: ndarray::Data,
T::Elem: Element,
D: ndarray::Dimension,
{
type Error = TchError;
fn try_from(value: &ndarray::ArrayBase<T, D>) -> Result<Self, Self::Error> {
let slice = value
.as_slice()
.ok_or_else(|| TchError::Convert("cannot convert to slice".to_string()))?;
let tn = Self::f_from_slice(slice)?;
let shape: Vec<i64> = value.shape().iter().map(|s| *s as i64).collect();
tn.f_reshape(shape)
}
}
// wrappers/tensor.rs
impl Tensor {
// ...
/// Converts a slice to a tensor.
pub fn f_from_slice<T: kind::Element>(data: &[T]) -> Result<Tensor, TchError> {
let data_len = data.len();
let data = data.as_ptr() as *const c_void;
let c_tensor = unsafe_torch_err!(at_tensor_of_data(
data,
[data_len as i64].as_ptr(),
1,
T::KIND.elt_size_in_bytes(),
T::KIND.c_int(),
));
Ok(Tensor { c_tensor })
}
}
and from the tchlib/torch_api.cpp:
tensor at_tensor_of_data(void *vs, int64_t *dims, size_t ndims, size_t element_size_in_bytes, int type) {
PROTECT(
torch::Tensor tensor = torch::zeros(torch::IntArrayRef(dims, ndims), torch::ScalarType(type));
if ((int64_t)element_size_in_bytes != tensor.element_size())
throw std::invalid_argument("incoherent element sizes in bytes");
void *tensor_data = tensor.data_ptr();
memcpy(tensor_data, vs, tensor.numel() * element_size_in_bytes);
return new torch::Tensor(tensor);
)
return nullptr;
}
This implementation is quite expensive and hurts the performance compared to the python API, which if I am not mistaken, allows us to convert a numpy array into a tensor by reusing the data.
I am wondering if it would make sense to have an implementation similar to the below:
fn ndarray_to_tensor<T, D>(array: ArrayBase<T, D>) -> Tensor
where
T: ndarray::Data,
T::Elem: kind::Element,
D: ndarray::Dimension,
{
let shape: Vec<i64> = array.shape().iter().map(|&s| s as i64).collect();
let strides: Vec<i64> = array.strides().iter().map(|&s| s as i64).collect();
let kind = get_kind::<T::Elem>();
unsafe {
let data_ptr = array.as_ptr();
// Calculate the byte length of the array
let num_bytes = array.len() * std::mem::size_of::<T>();
// Create a byte slice from the data
let byte_slice = std::slice::from_raw_parts(data_ptr as *const u8, num_bytes);
// Ensure the ndarray is not dropped while the Tensor exists
std::mem::forget(array);
// Get the raw pointer of the byte slice
let byte_slice_ptr = byte_slice.as_ptr();
Tensor::from_blob(byte_slice_ptr, &shape, &strides, kind, Device::Cpu)
}
}
pub fn get_kind<T: kind::Element>() -> Kind {
T::KIND
}
The device type above is harcoded, though we could infer at runtime if the device is Cpu or Cuda using the Rust API. However I did not find a way to infer if the types Mps or Vulkan. Possibly we could to infer this during the C++ runtime?
Performance comparison
I tested the proposed implementation vs. the current implementation and here's the average time taken to build the tensor:
For a ~40 MB tensor:
Current implementation: 6.581549ms
Proposed implementation: 37.708µs
For a ~400 MB tensor:
Current implementation: 153.497799ms
Proposed implementation: 50.508µs
For a ~800 MB tensor:
Current implementation: 394.885819ms
Proposed implementation: 68.493µs
The test I used to compute these is the following (ideally we would bench this properly for a prod solution) :
#[test]
fn from_ndarray() {
let (nrows, ncols, ndepth) = (2_000, 500, 100);
let iterations = 50;
let mut total_duration_tensor = Duration::new(0, 0);
let mut total_duration_tensor_2 = Duration::new(0, 0);
for _ in 0..iterations {
let nd = Array3::<f64>::zeros((nrows, ncols, ndepth));
let nd_clone = nd.clone();
// Timing for tensor
let start = Instant::now();
let tensor = Tensor::try_from(nd).unwrap();
total_duration_tensor += start.elapsed();
// Timing for tensor_2
let start = Instant::now();
let tensor_2 = ndarray_to_tensor(nd_clone);
total_duration_tensor_2 += start.elapsed();
// Check equality
assert_eq!(tensor, tensor_2);
}
let avg_duration_tensor = total_duration_tensor / iterations;
let avg_duration_tensor_2 = total_duration_tensor_2 / iterations;
println!(
"Average time taken to build tensor: {:?}",
avg_duration_tensor
);
println!(
"Average time taken to build tensor_2: {:?}",
avg_duration_tensor_2
);
}
I noticed that the current
try_from
implementation for converting ndarrays into tensors is copying the underlying data. This makes interoperability between ndarray --> tensor $O(n)$, however if we implement a zero copy solution we could bring this to $O(1)$. For reference the current implementation is:and from the tchlib/torch_api.cpp:
This implementation is quite expensive and hurts the performance compared to the python API, which if I am not mistaken, allows us to convert a numpy array into a tensor by reusing the data.
I am wondering if it would make sense to have an implementation similar to the below:
The device type above is harcoded, though we could infer at runtime if the device is
Cpu
orCuda
using the Rust API. However I did not find a way to infer if the typesMps
orVulkan
. Possibly we could to infer this during the C++ runtime?Performance comparison
I tested the proposed implementation vs. the current implementation and here's the average time taken to build the tensor:
For a ~40 MB tensor:
For a ~400 MB tensor:
For a ~800 MB tensor:
The test I used to compute these is the following (ideally we would bench this properly for a prod solution) :