Here, after a given number of episodes(Bath Size) we train the A3C agent with calculating the return. So we need to feed states, return, advantage function as a batch to optimize. But we only feed the initial state of the LSTM layer for the whole batch as the one which was there in the starting point. Is that correct? Can't we collect all the initial states that changed during an episode and feed them as a batch?
Here, after a given number of episodes(Bath Size) we train the A3C agent with calculating the return. So we need to feed states, return, advantage function as a batch to optimize. But we only feed the initial state of the LSTM layer for the whole batch as the one which was there in the starting point. Is that correct? Can't we collect all the initial states that changed during an episode and feed them as a batch?