Closed HyeokSuLee closed 1 year ago
Did you manage to solve this?
I have the same error, it only happens on backward_step_clip
/backward_step_clip_norm
, backward_step
works fine.
Minimal reproduction:
The bug requires two things to be true here:
_lstm1
must be declared, even if unusedbackward_step_*
, not backward_step
use tch::{
nn::{self, LinearConfig, Module, OptimizerConfig, RNNConfig},
Device, Kind, Reduction, Tensor,
};
fn main() {
println!("Hello, world!");
let vs = nn::VarStore::new(Device::Cpu);
let mut opt = nn::adam(0.9, 0.999, 0.01).build(&vs, 0.001).unwrap();
// This is not used, but only reproducible when this exists.
let _lstm1 = nn::lstm(
&vs.root() / "lstm1",
1,
100,
RNNConfig {
..Default::default()
},
);
let dense1 = nn::linear(
&vs.root() / "dense1",
100,
1,
LinearConfig {
..Default::default()
},
);
let mut input_window = Vec::with_capacity(100);
for i in 0.. {
let value = (i as f32 * 0.1).sin();
if input_window.len() >= input_window.capacity() {
let output = Tensor::from_slice(&input_window);
let output = dense1.forward(&output.squeeze());
let target = Tensor::from_slice(&[value]).to_kind(Kind::Float);
let cost = output.mse_loss(&target, Reduction::Mean);
// Without clipping works instead: opt.backward_step(&cost);
opt.backward_step_clip(&cost, 0.1);
break;
}
if input_window.len() >= input_window.capacity() {
let _ = input_window.remove(0);
}
input_window.push(value);
}
}
My guess is that when we compute the clipped value, we requires gradient for all the variables in the var store to be available so having extra variables that are not used to compute the loss makes this tip. I'll make some small adjustement so that we can get around this.
I just merged #771 that should hopefully help with this, your repro does not fail anymore with the change applied but it would be good if you can double check that the resulting gradients make sense (though this is covered in some automated tests already). Thanks a lot for providing the small repro!
Closing as there hasn't been any follow up and hopefully the issue is now resolved.
Thanks, it works.
I just noticed the issue, updated my dependency to GitHub latest, and backward_step_clip
now works.
tch = { git = "https://github.com/LaurentMazare/tch-rs", features = ["download-libtorch"] }
using bevy with tch-rs. Use bevy for a3c. struct
GlobalParam
for global weights. In thread function, collect actions with agent and send them to global. Then global update weights with optimizer. Loss has it's own value which is Tensor type (not None). But below error occurs withoptimizer.backward_step_clip(&loss, 0.5)
, howeveroptimizer.backward_step()
works perfectly. I think this is weird behaviour and i cannot get much infom about this in google. Is this kind of bug? (In non multi thread code, backward_step_clip worked well too. But my code optimize global value in single thread. So I don't think it's multi thread kind error.)-----------------------ERROR MESSAGE--------------------------------------