Open mullerhai opened 1 year ago
tch
is a thin layer around the C++ libtorch api, I don't think that the dataloader and dataset apis are available there so I would think that should be done as part of external crates if possible.
Re multi-head attention, the C++ functions should be exposed, e.g. native_multi_head_attention. You can see how its used in the pytorch codebase here for the C++ side and here for the Python side and probably adapt these using tch
.
Regarding the attention layers, I had an attempt at using f_internal_native_multi_head_attention
following the provided links. However doing so raises the following error upon calling backwards:
Torch("derivative for aten::_native_multi_head_attention is not implemented")
.
The same error is reproducible via this (nonsense) example using tch 0.14.0 and libtorch 2.1.0 (tried on both cpu and gpu):
use tch::nn::OptimizerConfig;
use tch::{nn, Device, Kind, Tensor};
const N_EMBED: i64 = 32;
const MAX_SEQ_LEN: i64 = 48;
const N_HEAD: i64 = 2;
const BATCH_SIZE: i64 = 4;
fn main() {
let device = Device::cuda_if_available();
let var_store = nn::VarStore::new(device);
let path = var_store.root();
let qkv_weight = path.zeros("qkv_weight", &[N_EMBED * 3, N_EMBED]);
let qkv_bias = path.zeros("qkv_bias", &[N_EMBED * 3]);
let proj_weight = path.zeros("proj_weight", &[N_EMBED, N_EMBED]);
let proj_bias = path.zeros("proj_bias", &[N_EMBED]);
let mask = Tensor::full(
&[MAX_SEQ_LEN, MAX_SEQ_LEN],
f64::NEG_INFINITY,
(Kind::Float, path.device()),
)
.triu(1);
let mut opt = nn::AdamW::default().build(&var_store, 0.001).unwrap();
let xs = Tensor::rand(
&[BATCH_SIZE, MAX_SEQ_LEN, N_EMBED],
(Kind::Float, path.device()),
);
let att = Tensor::f_internal_native_multi_head_attention(
&xs,
&xs,
&xs,
N_EMBED,
N_HEAD,
&qkv_weight,
&qkv_bias,
&proj_weight,
&proj_bias,
Some(&mask),
false,
true,
0,
)
.unwrap()
.0;
let att = att.sum(Kind::Float);
opt.backward_step(&att);
}
Is this behaviour expected? Presume I maybe missing something trivial given my lack of experience in torch/ cpp.
This mostly means that the cpp side doesn't handle the backward step for this function. Not sure what PyTorch does around this, maybe backprop is not supported for this fast attention layer or maybe it's handled on the python side so we don't have access to it here.
Gotcha. Tried a higher level func internal_transformer_encoder_layer_fwd
and got the same error. Thanks for the useful links above, would have never figured this out without those. Will continue with the more manual attention implementions for now.
Had a bit more time recently to look into this. If anyone else is still looking for native attention in tch, it looks like Tensor::scaled_dot_product_attention
is the function to use, forward and backward works no problem. Looking through the cpp code, we'll get flash/ efficient attention based on what scaled-dot-product (SDP) backend is available.
The caveat being, as far as I can see via tch and cpp api docs, we can’t seem to explicitly enable some options alike in pytorch, nor can we tell which SDP backend is available.
HI: I need to load data like python pytorch style,but not found the dataloader and dataset api ,and the multihead attention layer also not found ,do we tch-rs project support them? thanks