does the code in evaluation.py currently match what was used for the paper results? did the paper results use any specific settings? I trained a phase 2 policy in my own repo which matched the training performance of a a phase 2 policy in your repo but my eval metrics are:
Mean reward: 8.81$\pm$5.61
Mean episode length: 380.12$\pm$255.50
Mean number of waypoints: 0.44$\pm$0.32
Mean edge violation: 0.16$\pm$0.44
which doesnt seem to match the paper results. Also, how can i view evaluate scores split up by terrain?
does the code in evaluation.py currently match what was used for the paper results? did the paper results use any specific settings? I trained a phase 2 policy in my own repo which matched the training performance of a a phase 2 policy in your repo but my eval metrics are:
Mean reward: 8.81$\pm$5.61 Mean episode length: 380.12$\pm$255.50 Mean number of waypoints: 0.44$\pm$0.32 Mean edge violation: 0.16$\pm$0.44
which doesnt seem to match the paper results. Also, how can i view evaluate scores split up by terrain?