Open AlexBrians opened 10 months ago
In the model training process, I noticed that rewards are utilized directly instead of returns-to-go as mentioned (in line 123).
Additionally, there appears to be an inconsistency regarding timesteps((in line 122). They should represent the timestep for each individual trajectory rather than collectively for a single batch.
At
./models/general_model_transforlight.py
and./models/general_model_DT.py
The action and returns-to-go are set as zero matrixes (in line 70 and 71), which are not valid for DT.