coreylowman / dfdx

Deep learning in Rust, with shape checked tensors and neural networks
Other
1.72k stars 98 forks source link

impl Default for Tensor? #822

Open emchristiansen opened 1 year ago

emchristiansen commented 1 year ago

Thanks for working on this! I once had a project where we regularly worked with 6 dimensional tensors and it was such a pain to keep track of the axes we wrote a separate library to track them for us - something like this would have been great!

Is it possible to impl Default for Tensor in any reasonable way? E.g. if I only have the type Tensor<S, E, D, T>, can I generate a Tensor of zeros of that type? So far the only construction method I've seen for Tensors uses a device object.

Why I care: I have crazy nested datastructures that I want to compute gradients through using dfdx. The datastructures can be parameterized with anything "number like", and for something to be "number like" it has to have an additive identity element (zero), i.e. the default value.

Relatedly, ensuring Tensor<Rank0, _, _, _> impls all the num-like traits would be amazing, as it would make it a drop-in replacement for f32, with the side-effect of getting gradients for free. This would make me very happy.

coreylowman commented 1 year ago

I think this would require thread local device objects (like rand::thread_rng(), but if we had that it would be possible. Imagining something like:

pub fn thread_cpu() -> Cpu { ... }
pub fn thread_cuda(ordinal: usize) -> Cuda { ... }

impl<S: Shape, E: Dtype> Default for Tensor<S, E, Cpu> {
    fn default() -> Self {
         thread_cpu().zeros()
    }
}

impl<S: Shape, E: Dtype> Default for Tensor<S, E, Cuda> {
    fn default() -> Self {
         thread_cuda(0).zeros()
    }
}

I'm unsure how sound these thread local objects are though, would have to think about it. It would be weird to mix the use of the thread local object and a separate object.

emchristiansen commented 1 year ago

As a workaround, assuming I'm doing everything on a single device (say the CPU for now), could I just define something like this and use it for my device everywhere, assuming I'm careful to remain inside the same system thread*?

pub static DFDX_DEVICE: Lazy<Cpu> = Lazy::new(|| Cpu::default());

fn foo() {
  let weight: Tensor<Rank2<4, 2>, f32, _, NoneTape> =
    DFDX_DEVICE.sample_normal();
  ...
}

But even if that worked, what about the gradient tape? If T is NoneTape it's pretty clear what to do, but what if T is OwnedTape<..>? What would the correct default value be in that case?

*Also, is thread locality important for Cpu or just Cuda?

coreylowman commented 1 year ago

As a workaround, assuming I'm doing everything on a single device (say the CPU for now), could I just define something like this and use it for my device everywhere, assuming I'm careful to remain inside the same system thread*?

Yeah definitely!

But even if that worked, what about the gradient tape?

Probably just call .traced() after construction - none of the tensor creation methods currently create OwnedTapes, so that would be consistent.

*Also, is thread locality import for Cpu or just Cuda?

If it's just for you you probably don't need to worry about it. The main thing is to minimize the number of different device objects. For CPU its mainly important if you want enable allocation caching (https://docs.rs/dfdx/latest/dfdx/tensor/trait.Cache.html). Same for CUDA, but CUDA also will load in kernels into the object, so you don't want to create different ones.