`try_from` ndarray to tensor using zero copy

I noticed that the current try_from implementation for converting ndarrays into tensors is copying the underlying data. This makes interoperability between ndarray --> tensor $O(n)$, however if we implement a zero copy solution we could bring this to $O(1)$. For reference the current implementation is:

// tensor/convert.rs
impl<T, D> TryFrom<ndarray::ArrayBase<T, D>> for Tensor
where
    T: ndarray::Data,
    T::Elem: Element,
    D: ndarray::Dimension,
{
    type Error = TchError;

    fn try_from(value: ndarray::ArrayBase<T, D>) -> Result<Self, Self::Error> {
        Self::try_from(&value)
    }
}

// ...

impl<T, D> TryFrom<&ndarray::ArrayBase<T, D>> for Tensor
where
    T: ndarray::Data,
    T::Elem: Element,
    D: ndarray::Dimension,
{
    type Error = TchError;

    fn try_from(value: &ndarray::ArrayBase<T, D>) -> Result<Self, Self::Error> {
        let slice = value
            .as_slice()
            .ok_or_else(|| TchError::Convert("cannot convert to slice".to_string()))?;
        let tn = Self::f_from_slice(slice)?;
        let shape: Vec<i64> = value.shape().iter().map(|s| *s as i64).collect();
        tn.f_reshape(shape)
    }
}

// wrappers/tensor.rs
impl Tensor {
// ...
    /// Converts a slice to a tensor.
    pub fn f_from_slice<T: kind::Element>(data: &[T]) -> Result<Tensor, TchError> {
        let data_len = data.len();
        let data = data.as_ptr() as *const c_void;
        let c_tensor = unsafe_torch_err!(at_tensor_of_data(
            data,
            [data_len as i64].as_ptr(),
            1,
            T::KIND.elt_size_in_bytes(),
            T::KIND.c_int(),
        ));
        Ok(Tensor { c_tensor })
    }
}

and from the tchlib/torch_api.cpp:

tensor at_tensor_of_data(void *vs, int64_t *dims, size_t ndims, size_t element_size_in_bytes, int type) {
  PROTECT(
    torch::Tensor tensor = torch::zeros(torch::IntArrayRef(dims, ndims), torch::ScalarType(type));
    if ((int64_t)element_size_in_bytes != tensor.element_size())
      throw std::invalid_argument("incoherent element sizes in bytes");
    void *tensor_data = tensor.data_ptr();
    memcpy(tensor_data, vs, tensor.numel() * element_size_in_bytes);
    return new torch::Tensor(tensor);
  )
  return nullptr;
}

This implementation is quite expensive and hurts the performance compared to the python API, which if I am not mistaken, allows us to convert a numpy array into a tensor by reusing the data.

I am wondering if it would make sense to have an implementation similar to the below:

fn ndarray_to_tensor<T, D>(array: ArrayBase<T, D>) -> Tensor
where
    T: ndarray::Data,
    T::Elem: kind::Element,
    D: ndarray::Dimension,
{
    let shape: Vec<i64> = array.shape().iter().map(|&s| s as i64).collect();
    let strides: Vec<i64> = array.strides().iter().map(|&s| s as i64).collect();
    let kind = get_kind::<T::Elem>();

    unsafe {
        let data_ptr = array.as_ptr();

        // Calculate the byte length of the array
        let num_bytes = array.len() * std::mem::size_of::<T>();

        // Create a byte slice from the data
        let byte_slice = std::slice::from_raw_parts(data_ptr as *const u8, num_bytes);

        // Ensure the ndarray is not dropped while the Tensor exists
        std::mem::forget(array);

        // Get the raw pointer of the byte slice
        let byte_slice_ptr = byte_slice.as_ptr();

        Tensor::from_blob(byte_slice_ptr, &shape, &strides, kind, Device::Cpu)
    }
}

pub fn get_kind<T: kind::Element>() -> Kind {
    T::KIND
}

The device type above is harcoded, though we could infer at runtime if the device is Cpu or Cuda using the Rust API. However I did not find a way to infer if the types Mps or Vulkan. Possibly we could to infer this during the C++ runtime?

Performance comparison

I tested the proposed implementation vs. the current implementation and here's the average time taken to build the tensor:

For a ~40 MB tensor:

Current implementation: 6.581549ms
Proposed implementation: 37.708µs

For a ~400 MB tensor:

Current implementation: 153.497799ms
Proposed implementation: 50.508µs

For a ~800 MB tensor:

Current implementation: 394.885819ms
Proposed implementation: 68.493µs

The test I used to compute these is the following (ideally we would bench this properly for a prod solution) :

#[test]
fn from_ndarray() {
    let (nrows, ncols, ndepth) = (2_000, 500, 100);

    let iterations = 50;
    let mut total_duration_tensor = Duration::new(0, 0);
    let mut total_duration_tensor_2 = Duration::new(0, 0);

    for _ in 0..iterations {
        let nd = Array3::<f64>::zeros((nrows, ncols, ndepth));
        let nd_clone = nd.clone();

        // Timing for tensor
        let start = Instant::now();
        let tensor = Tensor::try_from(nd).unwrap();
        total_duration_tensor += start.elapsed();

        // Timing for tensor_2
        let start = Instant::now();
        let tensor_2 = ndarray_to_tensor(nd_clone);
        total_duration_tensor_2 += start.elapsed();

        // Check equality
        assert_eq!(tensor, tensor_2);
    }

    let avg_duration_tensor = total_duration_tensor / iterations;
    let avg_duration_tensor_2 = total_duration_tensor_2 / iterations;

    println!(
        "Average time taken to build tensor: {:?}",
        avg_duration_tensor
    );
    println!(
        "Average time taken to build tensor_2: {:?}",
        avg_duration_tensor_2
    );
}

LaurentMazare / tch-rs

`try_from` ndarray to tensor using zero copy #841

Performance comparison