Question about this paper and code

CAI23sbP commented 3 months ago

How are you @WzWang-Robot ? I read your paper and code. I have two question this paper and code.

A generally Preference based RL(PbRL) assumes that MDP is in a "fixed horizon". But in your paper and code there is not exist about the fixed horizon. see a detail reference the reference said that if we were using the PbRL in a variable horizon, it made a deeply misleading. And Openai's paper said that they have used the fixed horizon (didn't send a terminal info to agent; see appendix A, page 14) Could you explain about this? (Is it actually okay to use the PbRL in the variable horizon? )
Could you open full of code about SAN-FAPL? and SAN-NaviSTAR config file(batch_size , segment_size, etc...) I want to recreate these methods

WzWang-Robot commented 2 months ago

Hi,

Thanks so much for your questions.

For the first problem, I utilized a fixed horizon on RLHF procedure. Whereas, only aligned segments (two segments with same timesteps) can be involved into reward learning procedure.

Second, the configuration file of FAPL can be found from the GitHub files.

Finally, the RLHF of NaviSTAR is a fine-tune process. And I will share the RLHF procedure code soon.

If you have any questions, please let me know.

Thanks, Weizheng

CAI23sbP commented 4 weeks ago

@WzWang-Robot Hi, When do you open this?

WzWang-Robot / SAN_NaviSTAR

Question about this paper and code #2