-
Hello!
I found a small issue: on line 86 of the file [docs/spinningup/rl_intro2.rst](https://github.com/openai/spinningup/blob/master/docs/spinningup/rl_intro2.rst), there is a broken Google Drive …
-
To set `policy_stable` variable, provided code checks whether the policy is changed. If there are multiple optimal policies, the policy may change infinitely even though optimal policy is already foun…
-
- 以图10.1.1举例,这里没有a,s,r的符号,读者无法和后面的解释对应起来
- 10.1.2 的Gt公式是可以有展开形式的,如果写出来会更容易理解
- $\gamma$ 我记得是 (0,1], 不是 [0,1)
- $\pi (a|s)=p(a_t=a|s_t=s)$ 这类的公式(我认为的)标准写法是 $\pi (a|s)=p(A_t=a|S_t=s)$
- 10.1 参考文献内容太…
-
Generalized PI (GPI) method는 PI와 VI를 포함하는 조금 더 일반화된 업데이트 방식입니다.
연속시간 시스템에 대해서 GPI 기법이 처음 적용된 것은 제가 알기로는 다음 논문에서 입니다.
* [D. Vrabie and F. L. Lewis, “Generalized Policy Iteration for continuous-ti…
-
There's a culture in ML of authors making their textbooks available online (to supplement the traditional print editions), which is extremely beneficial to students & researchers. The following is a l…
ghost updated
5 years ago
-
If you're unfamiliar with eligibility traces, they basically unify temporal-difference learning with monte carlo methods -- essentially you hold a buffer in memory of an agent's experience and perform…
-
# Introduction
Hi 👋🏻, Laura!
First of all, great job! The notebook is well organized and easy to read. \
The comments you added to explain what you did are really useful.
# Algorithm analysi…
-
I think td_error in AC is same with advantage in baseline solution, which are all reward minus predicted value.
One difference is AC value network is learning in TD, baseline solution is learning d…
-
I encountered an error while testing the Basic_Run from the codebase, as shown in the attached image ![image](https://github.com/user-attachments/assets/0fffbb67-8630-4e6e-9a84-ec2beb614b42) I have tr…
-
个人主页,个人学习生涯!
学习流程:
> 第一遍, 通读全文,了解内容
>
> 第二遍,针对性阅读,并记录心得
>
> 第三遍,理论结合实践
一点一点搬运到博客上
## ref
- [Deep learning papers reading roadmap](https://github.com/floodsung/Deep-Learning…