LaurentMazare / tch-rs

Rust bindings for the C++ api of PyTorch.
Apache License 2.0
4.33k stars 343 forks source link

backward step clip weird error #503

Closed HyeokSuLee closed 1 year ago

HyeokSuLee commented 2 years ago

using bevy with tch-rs. Use bevy for a3c. struct GlobalParam for global weights. In thread function, collect actions with agent and send them to global. Then global update weights with optimizer. Loss has it's own value which is Tensor type (not None). But below error occurs with optimizer.backward_step_clip(&loss, 0.5), however optimizer.backward_step() works perfectly. I think this is weird behaviour and i cannot get much infom about this in google. Is this kind of bug? (In non multi thread code, backward_step_clip worked well too. But my code optimize global value in single thread. So I don't think it's multi thread kind error.)

//Resource ( thread safe memory)
struct GlobalParams {
    vs: VarStore,

}

//Thread function
fn train_system(mut query: Query<(&mut Agent, &mut Env)>, mut global_data: ResMut<GlobalParams>) {
 query.for_each_mut(|(agent, mut train_env)| {
        /*-------------------------------- Training ------------------------------*/
        let model = model(&agent.vs.root(), train_env.observation_space(), 3);
        let mut action_log_probs = vec![];
        let mut state_values = vec![];
        let mut rewards = vec![];

        // weigh and bias checking value

        for _ in 0..episodes {
            let mut state = train_env.reset();

            loop {
                /*-------------------------------- select action ------------------------------*/
                let (critic, actor) = tch::with_grad(|| model(&state));
                let probs = actor.softmax(-1, Float);

                //sample 1 action
                let action = probs.multinomial(1, true).squeeze_dim(-1);

                let log_probs = actor.log_softmax(-1, Float);
                let action_log_prob = {
                    let index = action.unsqueeze(0).unsqueeze(0);

                    log_probs
                        .unsqueeze(0)
                        .gather(1, &index, false)
                        .squeeze_dim(-1)
                };

                //save to memory
                action_log_probs.push(action_log_prob);
                state_values.push(critic);

                let step = train_env.step(i64::from(&action));

                state = step.obs;

                rewards.push(step.reward.clone());

                /*-------------------------------- Update ------------------------------*/
                if train_env.get_step_progress_count() % 20 == 0 || step.is_done {
                    /*-----------------------
                    Training code.
                    Calculates actor and critic loss and performs backprop.
                    ------------------------*/
                    global_data.optimize(&action_log_probs, &state_values, &rewards);
                    //fetch_local_with_global_var();

                    //clear buffers
                    action_log_probs.clear();
                    state_values.clear();
                    rewards.clear();
                }

                /*-------------------------------- Episode Finished ------------------------------*/
                if step.is_done {
                    break;
                }
            }
        }
    });
}

impl GlobalParams {
    pub fn optimize(
        &mut self,
        action_log_probs: &Vec<Tensor>,
        state_values: &Vec<Tensor>,
        rewards: &Vec<f64>,
    ) {
        /*-----------------------
                 Training code.
          Calculates actor and critic loss and performs backprop.
        ------------------------*/
        let mut opt = nn::Adam::default().build(&self.vs, 0.0001).unwrap();
        let mut total_r = 0.;
        let mut policy_losses = vec![];
        let mut value_losses = vec![];
        let mut returns: Vec<f64> = vec![];

        for r in rewards.to_owned().into_iter().rev() {
            let gamma = 0.99;
            total_r = r + gamma * total_r as f64;
            returns.insert(0, total_r);
        }

        let returns = Tensor::from(returns.as_slice());
        let returns = (&returns - returns.mean(Float)) / (returns.std(true) + f64::EPSILON);
        let returns: Vec<f64> = returns.try_into().unwrap();

        for ((log_prob, value), r) in action_log_probs
            .iter()
            .zip(state_values.iter())
            .zip(returns.iter())
        {
            let advantage = r.to_owned() - value;
            policy_losses.push(-log_prob * advantage);

            value_losses.push(value.smooth_l1_loss(
                &Tensor::from(r.to_owned()),
                tch::Reduction::Mean,
                1.0,
            ))
        }

        let loss = (Tensor::stack(policy_losses.as_slice(), 0).sum(Float)
            + Tensor::stack(value_losses.as_slice(), 0).sum(Float))
        .set_requires_grad(true);

        opt.backward_step_clip(&loss, 0.5); <-----------------------------------------ERROR!
        // opt.backward_step(&loss);<------------------------------------------------OK
    }

-----------------------ERROR MESSAGE--------------------------------------

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Torch("Expected a proper Tensor but got None (or an undefined Tensor in C++) for argument #0 'self'\nException raised from checked_cast_variable at C:\\actions-runner\\_work\\pytorch\\pytorch\\builder\\windows\\pytorch\\torch\\csrc\\autograd\\VariableTypeManual.cpp:54 (most recent call first):\n00007FFDF8A0A4C200007FFDF8A0A460 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]\n00007FFDF8A09D8E00007FFDF8A09D40 c10.dll!c10::detail::torchCheckFail [<unknown file> @ <unknown line number>]\n00007FFCFA80CC7600007FFCFA80AAC0 torch_cpu.dll!torch::autograd::VariableType::allCUDATypes [<unknown file> @ <unknown line number>]\n00007FFCF9AEB00700007FFCF9ADA050 torch_cpu.dll!torch::autograd::GraphRoot::apply [<unknown file> @ <unknown line number>]\n00007FFCF8E552E300007FFCF8E551C0 torch_cpu.dll!at::_ops::clamp_::call [<unknown file> @ <unknown line number>]\n00007FF6DE4D002700007FF6DE4CFFD0 a3c_project.exe!atg_clamp_ [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\torch-sys-0.7.2\\libtch\\torch_api_generated.cpp.h @ 3471]\n00007FF6DE40BDE500007FF6DE40BCC0 a3c_project.exe!tch::wrappers::tensor::Tensor::f_clamp_<f64> [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\tch-0.7.2\\src\\wrappers\\tensor_fallible_generated.rs @ 6790]\n00007FF6DE40F03E00007FF6DE40F010 a3c_project.exe!tch::wrappers::tensor::Tensor::clamp_<f64> [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\tch-0.7.2\\src\\wrappers\\tensor_generated.rs @ 3502]\n00007FF6DE41335E00007FF6DE413230 a3c_project.exe!tch::nn::optimizer::Optimizer::clip_grad_value [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\tch-0.7.2\\src\\nn\\optimizer.rs @ 161]\n00007FF6DE41347F00007FF6DE413410 a3c_project.exe!tch::nn::optimizer::Optimizer::backward_step_clip [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\tch-0.7.2\\src\\nn\\optimizer.rs @ 208]\n00007FF6DDC37C7600007FF6DDC37240 a3c_project.exe!a3c_project::a3c::GlobalParams::optimize [I:\\Documents\\MachineLearning\\rust\\a3c_project\\src\\a3c.rs @ 252]\n00007FF6DDBFF05B00007FF6DDBFEA10 a3c_project.exe!a3c_project::a3c::train_system::closure$0 [I:\\Documents\\MachineLearning\\rust\\a3c_project\\src\\a3c.rs @ 176]\n00007FF6DDBBC41500007FF6DDBBC0F0 a3c_project.exe!bevy_ecs::query::state::QueryState<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >::for_each_unchecked_manual<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<>,tuple$<bevy_ecs::query::fetch::WriteFetch<a3c_project::a3c::Agent>,bevy_ecs::query::fetch::WriteFetch<a3c_project::env::Env> >,a3c_project::a3c::train_system::closure_env$0> [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_ecs-0.7.0\\src\\query\\state.rs @ 820]\n00007FF6DDBCB88800007FF6DDBCB850 a3c_project.exe!bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >::for_each_mut<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<>,a3c_project::a3c::train_system::closure_env$0> [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_ecs-0.7.0\\src\\system\\query.rs @ 506]\n00007FF6DDC3723A00007FF6DDC37210 a3c_project.exe!a3c_project::a3c::train_system [I:\\Documents\\MachineLearning\\rust\\a3c_project\\src\\a3c.rs @ 194]\n00007FF6DDBAC57000007FF6DDBAC520 a3c_project.exe!core::ops::function::FnMut::call_mut<void (*)(bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<a3c_project::a3c::GlobalParams>),tuple$<bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<a3c_project::a3c::GlobalParams> > > [/rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\\library\\core\\src\\ops\\function.rs @ 164]\n00007FF6DDBFE4D800007FF6DDBFE460 a3c_project.exe!core::ops::function::impls::impl$3::call_mut<tuple$<bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<a3c_project::a3c::GlobalParams> >,void (*)(bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<a3c_project::a3c::GlobalParams>)> [/rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\\library\\core\\src\\ops\\function.rs 
@ 291]\n00007FF6DDBD572C00007FF6DDBD56C0 a3c_project.exe!bevy_ecs::system::function_system::impl$20::run::call_inner<tuple$<>,bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<a3c_project::a3c::GlobalParams>,ref_mut$<void (*)(bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<a3c_project::a3c::GlobalParams>)> > [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_ecs-0.7.0\\src\\system\\function_system.rs @ 525]\n00007FF6DDBFE1B300007FF6DDBFE100 a3c_project.exe!bevy_ecs::system::function_system::impl$20::run<tuple$<>,void (*)(bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<a3c_project::a3c::GlobalParams>),bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<a3c_project::a3c::GlobalParams> > [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_ecs-0.7.0\\src\\system\\function_system.rs @ 528]\n00007FF6DDBD5B1A00007FF6DDBD5A70 a3c_project.exe!bevy_ecs::system::function_system::impl$5::run_unsafe<tuple$<>,tuple$<>,tuple$<bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<a3c_project::a3c::GlobalParams> >,tuple$<>,void (*)(bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<a3c_project::a3c::GlobalParams>)> [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_ecs-0.7.0\\src\\system\\function_system.rs @ 399]\n00007FF6DE22EBC900007FF6DE22E970 a3c_project.exe!bevy_ecs::schedule::executor_parallel::impl$2::prepare_systems::async_block$0 [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_ecs-0.7.0\\src\\schedule\\executor_parallel.rs @ 196]\n00007FF6DE239E0900007FF6DE239DB0 a3c_project.exe!core::future::from_generator::impl$1::poll<enum$<bevy_ecs::schedule::executor_parallel::impl$2::prepare_systems::async_block_env$0> > [/rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\\library\\core\\src\\future\\mod.rs @ 91]\n00007FF6DE26EA1900007FF6DE26E840 a3c_project.exe!async_executor::impl$4::spawn::async_block$0<tuple$<>,core::future::from_generator::GenFuture<enum$<bevy_ecs::schedule::executor_parallel::impl$2::prepare_systems::async_block_env$0> > > [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\async-executor-1.4.1\\src\\lib.rs @ 144]\n00007FF6DE239F2900007FF6DE239ED0 a3c_project.exe!core::future::from_generator::impl$1::poll<enum$<async_executor::impl$4::spawn::async_block_env$0<tuple$<>,core::future::from_generator::GenFuture<enum$<bevy_ecs::schedule::executor_parallel::impl$2::prepare_systems::async_block_env$0> > > > > [/rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\\library\\core\\src\\future\\mod.rs @ 91]\n00007FF6DE2F7DA500007FF6DE2F7A70 a3c_project.exe!async_task::raw::RawTask<core::future::from_generator::GenFuture<enum$<async_executor::impl$4::spawn::async_block_env$0<tuple$<>,core::future::from_generator::GenFuture<enum$<bevy_ecs::schedule::executor_parallel::impl$2::prepare_systems::async_block_env$0> > > > >,tuple$<>,async_executor::impl$4::schedule::closure_env$0>::run<core::future::from_generator::GenFuture<enum$<async_executor::impl$4::spawn::async_block_env$0<tuple$<>,core::future::from_generator::GenFuture<enum$<bevy_ecs::schedule::executor_parallel::impl$2::prepare_systems::async_block_env$0> > > > >,tuple$<>,async_executor::impl$4::schedule::closure_env$0> [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\async-task-4.2.0\\src\\raw.rs @ 489]\n00007FF6DE326B2100007FF6DE326AC0 a3c_project.exe!async_task::runnable::Runnable::run [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\async-task-4.2.0\\src\\runnable.rs @ 309]\n00007FF6DE31678900007FF6DE3166F0 a3c_project.exe!async_executor::Executor::try_tick [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\async-executor-1.4.1\\src\\lib.rs @ 181]\n00007FF6DE22442400007FF6DE2240F0 a3c_project.exe!bevy_tasks::task_pool::impl$2::scope::closure$0<bevy_ecs::schedule::executor_parallel::impl$1::run_systems::closure_env$1,tuple$<> > [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_tasks-0.7.0\\src\\task_pool.rs @ 233]\n00007FF6DE26FA5100007FF6DE26F990 a3c_project.exe!std::thread::local::LocalKey<async_executor::LocalExecutor>::try_with<async_executor::LocalExecutor,bevy_tasks::task_pool::impl$2::scope::closure_env$0<bevy_ecs::schedule::executor_parallel::impl$1::run_systems::closure_env$1,tuple$<> >,alloc::vec::Vec<tuple$<>,alloc::alloc::Global> > [/rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\\library\\std\\src\\thread\\local.rs @ 445]\n00007FF6DE26F90F00007FF6DE26F8C0 a3c_project.exe!std::thread::local::LocalKey<async_executor::LocalExecutor>::with<async_executor::LocalExecutor,bevy_tasks::task_pool::impl$2::scope::closure_env$0<bevy_ecs::schedule::executor_parallel::impl$1::run_systems::closure_env$1,tuple$<> >,alloc::vec::Vec<tuple$<>,alloc::alloc::Global> > [/rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\\library\\std\\src\\thread\\local.rs @ 421]\n00007FF6DE2240DA00007FF6DE224090 a3c_project.exe!bevy_tasks::task_pool::TaskPool::scope<bevy_ecs::schedule::executor_parallel::impl$1::run_systems::closure_env$1,tuple$<> > [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_tasks-0.7.0\\src\\task_pool.rs @ 238]\n00007FF6DE22DE8100007FF6DE22DCC0 a3c_project.exe!bevy_ecs::schedule::executor_parallel::impl$1::run_systems [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_ecs-0.7.0\\src\\schedule\\executor_parallel.rs @ 129]\n00007FF6DE2A97BF00007FF6DE2A8F30 a3c_project.exe!bevy_ecs::schedule::stage::impl$1::run [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_ecs-0.7.0\\src\\schedule\\stage.rs @ 855]\n00007FF6DE269A3A00007FF6DE269980 a3c_project.exe!bevy_ecs::schedule::Schedule::run_once [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_ecs-0.7.0\\src\\schedule\\mod.rs @ 341]\n00007FF6DE269AA500007FF6DE269A40 a3c_project.exe!bevy_ecs::schedule::impl$1::run [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_ecs-0.7.0\\src\\schedule\\mod.rs @ 370]\n00007FF6DE20D83600007FF6DE20D810 a3c_project.exe!bevy_app::app::App::update [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_app-0.7.0\\src\\app.rs @ 115]\n00007FF6DE21386A00007FF6DE2137A0 a3c_project.exe!bevy_app::schedule_runner::impl$2::build::closure$0::closure$0 [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_app-0.7.0\\src\\schedule_runner.rs @ 93]\n00007FF6DE21371200007FF6DE213640 a3c_project.exe!bevy_app::schedule_runner::impl$2::build::closure$0 [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_app-0.7.0\\src\\schedule_runner.rs @ 118]\n00007FF6DE20B3CA00007FF6DE20B360 a3c_project.exe!alloc::boxed::impl$46::call<tuple$<bevy_app::app::App>,dyn$<core::ops::function::Fn<tuple$<bevy_app::app::App>,assoc$<Output,tuple$<> > > >,alloc::alloc::Global> [/rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\\library\\alloc\\src\\boxed.rs @ 1887]\n00007FF6DE20D9E500007FF6DE20D8F0 a3c_project.exe!bevy_app::app::App::run [C:\\Users\\user_name\\.cargo\\registry\\src\\github.com-1ecc6299db9ec823\\bevy_app-0.7.0\\src\\app.rs @ 130]\n00007FF6DDC35F7E00007FF6DDC35F20 a3c_project.exe!a3c_project::a3c::setup [I:\\Documents\\MachineLearning\\rust\\a3c_project\\src\\a3c.rs @ 15]\n00007FF6DDB8E86300007FF6DDB8E850 a3c_project.exe!a3c_project::main [I:\\Documents\\MachineLearning\\rust\\a3c_project\\src\\main.rs @ 26]\n00007FF6DDBAC84300007FF6DDBAC830 a3c_project.exe!core::ops::function::FnOnce::call_once<enum$<core::result::Result<tuple$<>,cpython::err::PyErr>, 1, 18446744073709551615, Err> (*)(),tuple$<> > [/rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\\library\\core\\src\\ops\\function.rs @ 248]\n00007FF6DDBC795600007FF6DDBC7930 a3c_project.exe!std::sys_common::backtrace::__rust_begin_short_backtrace<enum$<core::result::Result<tuple$<>,cpython::err::PyErr>, 1, 18446744073709551615, Err> (*)(),enum$<core::result::Result<tuple$<>,cpython::err::PyErr>, 1, 18446744073709551615, Err> 
> [/rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\\library\\std\\src\\sys_common\\backtrace.rs @ 125]\n00007FF6DDBD126600007FF6DDBD1250 a3c_project.exe!std::rt::lang_start::closure$0<enum$<core::result::Result<tuple$<>,cpython::err::PyErr>, 1, 18446744073709551615, Err> > [/rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\\library\\std\\src\\rt.rs @ 145]\n00007FF6DE47FF4100007FF6DE47FE10 a3c_project.exe!std::rt::lang_start_internal [/rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library\\std\\src\\rt.rs @ 128]\n00007FF6DDBD122F00007FF6DDBD1200 a3c_project.exe!std::rt::lang_start<enum$<core::result::Result<tuple$<>,cpython::err::PyErr>, 1, 18446744073709551615, Err> > [/rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\\library\\std\\src\\rt.rs @ 144]\n00007FF6DDB8E89600007FF6DDB8E880 a3c_project.exe!main [<unknown file> @ <unknown line number>]\n00007FF6DE5992D000007FF6DE5991C4 a3c_project.exe!__scrt_common_main_seh [D:\\a\\_work\\1\\s\\src\\vctools\\crt\\vcstartup\\src\\startup\\exe_common.inl @ 288]\n00007FFE0CBF703400007FFE0CBF7020 KERNEL32.DLL!BaseThreadInitThunk [<unknown file> @ <unknown line number>]\n00007FFE0EB2265100007FFE0EB22630 ntdll.dll!RtlUserThreadStart [<unknown file> @ <unknown line number>]\n")', C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\tch-0.7.2\src\wrappers\scalar.rs:58:9
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library\std\src\panicking.rs:584
   1: core::panicking::panic_fmt
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library\core\src\panicking.rs:142
   2: core::result::unwrap_failed
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library\core\src\result.rs:1785
   3: enum$<core::result::Result<tuple$<>,enum$<tch::error::TchError> >, 0, 11, Err>::unwrap<tuple$<>,enum$<tch::error::TchError> >
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\library\core\src\result.rs:1078
   4: tch::wrappers::scalar::impl$2::drop
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\tch-0.7.2\src\wrappers\scalar.rs:58
   5: core::ptr::drop_in_place<tch::wrappers::scalar::Scalar>
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\library\core\src\ptr\mod.rs:486
   6: tch::wrappers::tensor::Tensor::f_clamp_<f64>
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\tch-0.7.2\src\wrappers\tensor_fallible_generated.rs:6790     
   7: tch::wrappers::tensor::Tensor::clamp_<f64>
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\tch-0.7.2\src\wrappers\tensor_generated.rs:3502
   8: tch::nn::optimizer::Optimizer::clip_grad_value
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\tch-0.7.2\src\nn\optimizer.rs:161
   9: tch::nn::optimizer::Optimizer::backward_step_clip
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\tch-0.7.2\src\nn\optimizer.rs:207
  10: a3c_project::a3c::GlobalParams::optimize
             at .\src\a3c.rs:252
  11: a3c_project::a3c::train_system::closure$0
             at .\src\a3c.rs:176
  12: bevy_ecs::query::state::QueryState<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >::for_each_unchecked_manual<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env:
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_ecs-0.7.0\src\query\state.rs:820
  13: bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >::for_each_mut<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<>,
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_ecs-0.7.0\src\system\query.rs:499
  14: a3c_project::a3c::train_system
             at .\src\a3c.rs:127
  15: core::ops::function::FnMut::call_mut<void (*)(bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<a3c_project::a3c::GlobalPar
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\library\core\src\ops\function.rs:164
  16: core::ops::function::impls::impl$3::call_mut<tuple$<bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<a3c_project::a3c::Glo
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\library\core\src\ops\function.rs:290
  17: bevy_ecs::system::function_system::impl$20::run::call_inner<tuple$<>,bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<coin_trade_ai_
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_ecs-0.7.0\src\system\function_system.rs:525
  18: bevy_ecs::system::function_system::impl$20::run<tuple$<>,void (*)(bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<coin_trade_ai_tra
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_ecs-0.7.0\src\system\function_system.rs:528
  19: bevy_ecs::system::function_system::impl$5::run_unsafe<tuple$<>,tuple$<>,tuple$<bevy_ecs::system::query::Query<tuple$<ref_mut$<a3c_project::a3c::Agent>,ref_mut$<a3c_project::env::Env> >,tuple$<> >,bevy_ecs::change_detection::ResMut<coin
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_ecs-0.7.0\src\system\function_system.rs:399
  20: bevy_ecs::schedule::executor_parallel::impl$2::prepare_systems::async_block$0
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_ecs-0.7.0\src\schedule\executor_parallel.rs:196
  21: core::future::from_generator::impl$1::poll<enum$<bevy_ecs::schedule::executor_parallel::impl$2::prepare_systems::async_block_env$0> > 
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\library\core\src\future\mod.rs:91
  22: async_executor::impl$4::spawn::async_block$0<tuple$<>,core::future::from_generator::GenFuture<enum$<bevy_ecs::schedule::executor_parallel::impl$2::prepare_systems::async_block_env$0> > >
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\async-executor-1.4.1\src\lib.rs:144
  23: core::future::from_generator::impl$1::poll<enum$<async_executor::impl$4::spawn::async_block_env$0<tuple$<>,core::future::from_generator::GenFuture<enum$<bevy_ecs::schedule::executor_parallel::impl$2::prepare_systems::async_block_env$0> > > > >
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\library\core\src\future\mod.rs:91
  24: async_task::raw::RawTask<core::future::from_generator::GenFuture<enum$<async_executor::impl$4::spawn::async_block_env$0<tuple$<>,core::future::from_generator::GenFuture<enum$<bevy_ecs::schedule::executor_parallel::impl$2::prepare_systems::async_block_env$
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\async-task-4.2.0\src\raw.rs:489
  25: async_task::runnable::Runnable::run
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\async-task-4.2.0\src\runnable.rs:309
  26: async_executor::Executor::try_tick
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\async-executor-1.4.1\src\lib.rs:181
  27: bevy_tasks::task_pool::impl$2::scope::closure$0<bevy_ecs::schedule::executor_parallel::impl$1::run_systems::closure_env$1,tuple$<> >  
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_tasks-0.7.0\src\task_pool.rs:233
  28: std::thread::local::LocalKey<async_executor::LocalExecutor>::try_with<async_executor::LocalExecutor,bevy_tasks::task_pool::impl$2::scope::closure_env$0<bevy_ecs::schedule::executor_parallel::impl$1::run_systems::closure_env$1,tuple$<> >,alloc::vec::Vec<tu
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\library\std\src\thread\local.rs:445
  29: std::thread::local::LocalKey<async_executor::LocalExecutor>::with<async_executor::LocalExecutor,bevy_tasks::task_pool::impl$2::scope::closure_env$0<bevy_ecs::schedule::executor_parallel::impl$1::run_systems::closure_env$1,tuple$<> >,alloc::vec::Vec<tuple$
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\library\std\src\thread\local.rs:421
  30: bevy_tasks::task_pool::TaskPool::scope<bevy_ecs::schedule::executor_parallel::impl$1::run_systems::closure_env$1,tuple$<> >
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_tasks-0.7.0\src\task_pool.rs:180
  31: bevy_ecs::schedule::executor_parallel::impl$1::run_systems
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_ecs-0.7.0\src\schedule\executor_parallel.rs:129
  32: bevy_ecs::schedule::stage::impl$1::run
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_ecs-0.7.0\src\schedule\stage.rs:852
  33: bevy_ecs::schedule::Schedule::run_once
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_ecs-0.7.0\src\schedule\mod.rs:341
  34: bevy_ecs::schedule::impl$1::run
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_ecs-0.7.0\src\schedule\mod.rs:359
  35: bevy_app::app::App::update
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_app-0.7.0\src\app.rs:114
  36: bevy_app::schedule_runner::impl$2::build::closure$0::closure$0
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_app-0.7.0\src\schedule_runner.rs:93
  37: bevy_app::schedule_runner::impl$2::build::closure$0
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_app-0.7.0\src\schedule_runner.rs:118
  38: alloc::boxed::impl$46::call<tuple$<bevy_app::app::App>,dyn$<core::ops::function::Fn<tuple$<bevy_app::app::App>,assoc$<Output,tuple$<> 
> > >,alloc::alloc::Global>
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\library\alloc\src\boxed.rs:1886
  39: bevy_app::app::App::run
             at C:\Users\user_name\.cargo\registry\src\github.com-1ecc6299db9ec823\bevy_app-0.7.0\src\app.rs:130
  40: a3c_project::a3c::setup
             at .\src\a3c.rs:15
  41: a3c_project::main
             at .\src\main.rs:23
  42: core::ops::function::FnOnce::call_once<enum$<core::result::Result<tuple$<>,cpython::err::PyErr>, 1, 18446744073709551615, Err> (*)(),tuple$<> >
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc\library\core\src\ops\function.rs:248
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
sabn0 commented 1 year ago

Did you manage to solve this?

nathanfranke commented 1 year ago

I have the same error, it only happens on backward_step_clip/backward_step_clip_norm, backward_step works fine.

nathanfranke commented 1 year ago

Minimal reproduction:

The bug requires two things to be true here:

use tch::{
    nn::{self, LinearConfig, Module, OptimizerConfig, RNNConfig},
    Device, Kind, Reduction, Tensor,
};

fn main() {
    println!("Hello, world!");

    let vs = nn::VarStore::new(Device::Cpu);
    let mut opt = nn::adam(0.9, 0.999, 0.01).build(&vs, 0.001).unwrap();

    // This is not used, but only reproducible when this exists.
    let _lstm1 = nn::lstm(
        &vs.root() / "lstm1",
        1,
        100,
        RNNConfig {
            ..Default::default()
        },
    );
    let dense1 = nn::linear(
        &vs.root() / "dense1",
        100,
        1,
        LinearConfig {
            ..Default::default()
        },
    );

    let mut input_window = Vec::with_capacity(100);

    for i in 0.. {
        let value = (i as f32 * 0.1).sin();

        if input_window.len() >= input_window.capacity() {
            let output = Tensor::from_slice(&input_window);

            let output = dense1.forward(&output.squeeze());

            let target = Tensor::from_slice(&[value]).to_kind(Kind::Float);
            let cost = output.mse_loss(&target, Reduction::Mean);
            // Without clipping works instead: opt.backward_step(&cost);
            opt.backward_step_clip(&cost, 0.1);

            break;
        }

        if input_window.len() >= input_window.capacity() {
            let _ = input_window.remove(0);
        }
        input_window.push(value);
    }
}
LaurentMazare commented 1 year ago

My guess is that when we compute the clipped value, we requires gradient for all the variables in the var store to be available so having extra variables that are not used to compute the loss makes this tip. I'll make some small adjustement so that we can get around this.

LaurentMazare commented 1 year ago

I just merged #771 that should hopefully help with this, your repro does not fail anymore with the change applied but it would be good if you can double check that the resulting gradients make sense (though this is covered in some automated tests already). Thanks a lot for providing the small repro!

LaurentMazare commented 1 year ago

Closing as there hasn't been any follow up and hopefully the issue is now resolved.

nathanfranke commented 1 year ago

Thanks, it works.

I just noticed the issue, updated my dependency to GitHub latest, and backward_step_clip now works.

tch = { git = "https://github.com/LaurentMazare/tch-rs", features = ["download-libtorch"] }