AxiomaticUncertainty / Deep-Q-Learning-for-Tic-Tac-Toe

Find more info @ youtube.com/axiomaticuncertainty
10 stars 7 forks source link

Question: #1

Open amaynez opened 3 years ago

amaynez commented 3 years ago

I see that you are using a 0 vector for the rewards, and only updating the value that corresponds to the action here: https://github.com/AxiomaticUncertainty/Deep-Q-Learning-for-Tic-Tac-Toe/blob/c5c03fdf52b0788643337a57246af939a8e184e8/tic_tac_toe.py#L121 https://github.com/AxiomaticUncertainty/Deep-Q-Learning-for-Tic-Tac-Toe/blob/c5c03fdf52b0788643337a57246af939a8e184e8/tic_tac_toe.py#L122

I understand that in Q Learning you should use the result from the Neural Network (not zeroes) with just the corresponding action value changed to the expected Q value. Do you have any idea on why is this working in your network?

AxiomaticUncertainty commented 3 years ago

Probably because the limited action space means that we're essentially just performing a pseudo-normalization step. I made this during high school, though, so I may have just forgotten or misread a part of the Q Learning policy.