No tensorboard output on evaluation: SS2

facebookresearch / sound-spaces

A first-of-its-kind acoustic simulation platform for audio-visual embodied AI research. It supports training and evaluating multiple tasks and applications.

https://soundspaces.org

Creative Commons Attribution 4.0 International

364 stars 58 forks source link

No tensorboard output on evaluation: SS2 #119

Closed kksinghal closed 1 year ago

kksinghal commented 1 year ago

Hi, I trained my policy for 8M steps in 24 hours. Then, to evaluate the 20 checkpoints, I ran the command python ss_baselines/av_nav/run.py --run-type eval --exp-config ss_baselines/av_nav/config/audionav/replica/val_telephone/audiogoal_depth.yaml --model-dir data/models/ss2/myreplica/dav_nav/ CONTINUOUS True

However, it's been running for more than 24 hours, and it hasn't stopped. And there are no tensorboard logs. A folder was created at path data/models/ss2/myreplica/tb/ and there is also a tensorboard file, but there are no logs.

Checkpoint dir = data/models/ss2/myreplica/dav_nav/data/ Training tensorboard logs = data/models/ss2/myreplica/dav_nav/tb/

Thanks Kartik

kksinghal commented 1 year ago

@ChanganVR I don't know how, but after rerunning, I got the tensorboard logs now. But, now I got another issue. The evaluation takes too long, it's been 2-3 days, and it is still on 9th checkpoint. In earlier checkpoints, it took around 20-25 minutes/checkpoint, and then after 3-4 ckpts, it started taking 1-2 hours for each checkpoint. And it seems like it is still evaluating the 9th checkpoint for more than 24 hours.

kksinghal commented 1 year ago

@ChanganVR I figured out that my model is not generalizing well to the validation set. So, it runs up till 500 steps, which takes around 2.5 minutes. That's around 21 hours for each checkpoint. Is it usually this slow for 500 steps? Is there any way I can speedup the evaluation, other than high-speed mode and manually splitting the json. I suppose that running on multiple nodes won't help, as it doesn't make sense here.

ChanganVR commented 1 year ago

Hi @kksinghal, evaluation taking long is mostly a sign of the model not working well.

500 steps taking 2.5 mins seems reasonable, which translates to about 3.3 fps for audio and visual rendering combined. This is pretty close to the speed we report in the paper.

To speed things up, you can try the high-speed mode, which does not harm the performance much. You could also increase the number of threads used by the acoustic simulation. The acoustic simulation speed scales pretty much linearly with the number of threads.

Lastly, as a hack, you can evaluate a smaller number of episodes for validation, e.g., 50 episodes for validating a checkpoint's performance and you only need to evaluate the best val checkpoint once on the full test set.

kksinghal commented 1 year ago

Thanks for your reply. My model had overfitted too much to the training set with the replica dataset. It works well with mp3d.