Seeking Clarification: Cumulative Rewards in Batch B1 and B2 - SDP Process (ISAC)

Feedback02 commented 6 months ago

Hello everyone,

I hope this message finds you well. I've been working on implementing the SDP (Part 1 of the paper), and I've come across a point of potential confusion regarding the cumulative rewards for transitions in batch B1.

The paper mentions that every transition within batch B1 should have the same cumulative reward( following the math description of B_1), but upon reviewing the code, it seems that transitions are randomly selected with the possibility of having different cumulative rewards.

Before jumping to any conclusions, I wanted to open up a discussion and seek clarification from the community and maintainers. Could someone please shed light on whether the intended behavior is to have uniform cumulative rewards for all transitions in B1, or if the current code aligns with the paper's specifications?

cbanerji commented 6 months ago

Sorry for the confusion. In SDP, the episodic reward is not same for entire B1, batches are sampled sampled randomly from buffer. The modification is made to each transition before saving it to buffer. An additional element is added to each transition tuple before saving to buffer. This additional element/ score is the episodic return, which remains same for all the transitions of a certain episode only, check code lines: 403-404,of sac_isac.py

Feedback02 commented 6 months ago

thank you for the clarification and the fast response, got it!

cbanerji / Sample_efficient_RL.

Seeking Clarification: Cumulative Rewards in Batch B1 and B2 - SDP Process (ISAC) #1