Closed OrilinZ closed 5 months ago
Thank you for expressing interest in our research. Actually, our method follows a 2-stage training scheme. Our methodology primarily employs GPT-4 for training data collection, and the simulator will simultaneously mark success/failure subtasks as environmental feedbacks. Note that collected training data is used for SFT at the first stage, and the feedbacks are used for RLEF at the second stage. So the data collection(simulation) and RLEF don't happen simultaneously.
I'm curious about the RLEF process, and it's a pioneer work in embodied agents. Is the LLM run by simulation and tuned by RL simultaneously?