facebookresearch / sound-spaces

A first-of-its-kind acoustic simulation platform for audio-visual embodied AI research. It supports training and evaluating multiple tasks and applications.
https://soundspaces.org
Creative Commons Attribution 4.0 International
343 stars 55 forks source link

FileNotFoundError: [Errno 2] No such file or directory: 'configs/semantic_audionav/av_snav/mp3d/semantic_audiogoal_no_segmentation.yaml' #46

Open gtatiya opened 3 years ago

gtatiya commented 3 years ago

Hi @ChanganVR,

When I run python ss_baselines/savi/run.py --exp-config ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml --model-dir data/models/savi, I get this error:

Traceback (most recent call last):
  File "ss_baselines/savi/run.py", line 144, in <module>
    main()
  File "ss_baselines/savi/run.py", line 95, in main
    config = get_config(args.exp_config, args.opts, args.model_dir, args.run_type, args.overwrite)
  File "/home/i21_gtatiya/projects/sound-spaces/ss_baselines/savi/config/default.py", line 253, in get_config
    config.TASK_CONFIG = get_task_config(config_paths=config.BASE_TASK_CONFIG_PATH)
  File "/home/i21_gtatiya/projects/sound-spaces/ss_baselines/savi/config/default.py", line 313, in get_task_config
    config.merge_from_file(config_path)
  File "/home/i21_gtatiya/miniconda3/envs/avn/lib/python3.6/site-packages/yacs/config.py", line 211, in merge_from_file
    with open(cfg_filename, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'configs/semantic_audionav/av_snav/mp3d/semantic_audiogoal_no_segmentation.yaml'

Could you please fix this error.

ChanganVR commented 3 years ago

Thanks. It has been fixed.

gtatiya commented 3 years ago

Thanks. Now, I do not get that error, but I get this error:

INFO:root:AudioNavSMTNet ===> Freezing goal, visual, fusion encoders!
Traceback (most recent call last):
  File "ss_baselines/savi/run.py", line 144, in <module>
    main()
  File "ss_baselines/savi/run.py", line 107, in main
    trainer.train()
  File "/home/gyan/Documents/sound-spaces/ss_baselines/savi/ddppo/algo/ddppo_trainer.py", line 239, in train
    self._setup_actor_critic_agent(ppo_cfg)
  File "/home/gyan/Documents/sound-spaces/ss_baselines/savi/ddppo/algo/ddppo_trainer.py", line 147, in _setup_actor_critic_agent
    pretrained_state = torch.load(self.config.RL.DDPPO.pretrained_weights, map_location="cpu")
  File "/home/gyan/miniconda3/envs/avn/lib/python3.6/site-packages/torch/serialization.py", line 581, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/gyan/miniconda3/envs/avn/lib/python3.6/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/gyan/miniconda3/envs/avn/lib/python3.6/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'data/models/savi/data/ckpt.50.pth'

This is the list of files in savi: image

There is no data folder. I ran python ss_baselines/savi/pretraining/audiogoal_trainer.py --run-type train --model-dir data/models/savi --predict-label before and this is its logs: pre-train_audiogoal_savi_log.txt

ChanganVR commented 3 years ago

which config are you using? You need to update the weight path with the best savi model you pretrained

gtatiya commented 3 years ago

I am running the code mentioned here: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/README.md. I am not changing the default config file. It is not mentioned in the README that I need to update the weight path with best savi model. There is a file called best_val.pth is that the best savi model? Shall I rename best_val.pth to ckpt.50 and copy it to data/models/savi/data/? Why your code is looking for data/models/savi/data/ckpt.50.pth?

ChanganVR commented 3 years ago

I should've been more clear. The model is first pretrained with memory size 1 (savi_pretraining.yaml) and then trainedw with full memory size (savi.yaml). You'll need to update ckpt.50 to the best checkpoint from the pretraining when doing second step.

best_val.pth is the best checkpoint for the second step, not for the first one. I'll update the description accordingly.

gtatiya commented 3 years ago

I am not sure how to update ckpt.50 to the best checkpoint. The audiogoal_trainer.py script only generate these files:

best_val.pth  ckpt.1.pth  ckpt.3.pth  ckpt.8.pth
ckpt.0.pth    ckpt.2.pth  ckpt.4.pth  tb

There is no ckpt.50 file. Could you please explain how to update ckpt.50 to the best checkpoint from the pretraining when doing second step?

Also, you said best_val.pth is the best checkpoint for the second step. The the first step if giving me the error above, and thus I never ran the second step (savi.yaml), then how come best_val.pth is generated?

ChanganVR commented 3 years ago

Somehow, the config is wrong. Just setting this value (https://github.com/facebookresearch/sound-spaces/blob/1e2e8a35f4a205639778d9d77044c699637a4085/ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml#L58) to False should work for you.

gtatiya commented 3 years ago
ChanganVR commented 3 years ago

For your first question, I tried predicting labels alone and predicting both. The classsification accuracies for them were similar. I did end up using the label predictor trained with joint traning but I think this does not make a big difference.

That is definitely okay. It's just that the model directory might change so you need to update the path.

gtatiya commented 3 years ago

In the case where only predict_label is True, the model has 21 outputs each belonging to a class, but why are you computing loss for the prediction of last 2 classes based on ground truth location here: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/pretraining/audiogoal_trainer.py#L118? In other words, why in the case of predict_label you used regressor_loss = regressor_criterion(predicts[:, -2:], gts[:, -2:]) where as in the case of predict_location you used classifier_loss = torch.tensor([0], device=self.device)?

image

ChanganVR commented 3 years ago

that looks like a bug, when predicting label only, the loss should only consist of the classifier loss. I'll replace that line with regressor_loss = torch.tensor([0], device=self.device)

gtatiya commented 3 years ago

Thanks, could you please fix this change, so that I can re-run code?

I have a question regarding how to run savi code:

ChanganVR commented 3 years ago

Yes, I'll make this change and push soon.

Both the pretraining and full training steps take the best model of the previous step to continue training. Yeah, it would be good to automate this process and I'll try to add some code for that.

gtatiya commented 3 years ago

So, does savi_pretraining.yaml will use data/models/savi/best_val.pth automatically or I need to make some change for that?

ChanganVR commented 3 years ago

I realized later that the savi-pretraining model needs to be validated separately from training. And for that reason, I didn't automate this process. But as I mentioned earlier, both of these two steps only need to be trained once and should be the same for variants or ablations of the main model, and thus manually updating the weights shouldn't be too much of a cost.

gtatiya commented 3 years ago

I completed the first step of training the label predictor using audiogoal_trainer.py script. Now, I want to pre-train the savi model with savi_pretraining.yaml, and my question is do I need to make any changes manually to make the savi model use data/models/savi/best_val.pth generated by audiogoal_trainer.py? When I tried removing checkpoints generated by audiogoal_trainer.py, and ran the script to pre-train the savi model with savi_pretraining.yaml, it still seems to train and save checkpoints in data/models/savi/data/, so I am not sure if it is using data/models/savi/best_val.pth or not. Could you please specifically clarify this issue?

ChanganVR commented 3 years ago

I'm not sure if you're aware of this, but this line of code loads the data: https://github.com/facebookresearch/sound-spaces/blob/9e3318b5e54f3246a606d2373ca60e4b7efc11f1/ss_baselines/savi/models/belief_predictor.py#L97

And since I'm specifying the pretrained weights here so you do need to update the path with your own model weights.

gtatiya commented 3 years ago

Thanks for answering. I do not have pretrained_weights in my data directory, but still I was able to pre-train savi model with savi_pretraining.yaml. After training, there were 400 checkpoints in data/savi/data directory. Do you know why your code did not use data/pretrained_weights/semantic_audionav/savi/label_predictor.pth?

I think, it is because in savi_pretraining.yaml, you have set use_belief_predictor to False: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml#L37. Is it supposed to be True?

Could you please help me train the savi model in the same way published in the paper?

gtatiya commented 3 years ago

@ChanganVR, Could you please answer the questions I asked above?

ChanganVR commented 3 years ago

@gtatiya yeah, I was just checking the configuration and sorry about the delay. I found it was because when I cleaned up my code, the pretraining configuration somehow got messed up. I just pushed the new config files. This should work now.

ChanganVR commented 3 years ago

You can check this commit for the detailed changes I made: https://github.com/facebookresearch/sound-spaces/commit/721333f8c034384613bd6510a122506cb5446f38

gtatiya commented 3 years ago

Thank you for making changes to fix the issue, but I am still facing issue to run savi code. I am trying to run python ss_baselines/savi/run.py --exp-config ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml --model-dir data/models/savi, and before running that, I added this pretrained_weights: "data/models/savi/best_val.pth", here: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml#L63. The issue is the code is stuck, and nothing is happening. Please see the logs attached here: pre-train_model_savi_log.txt. I also tried removing pretrained_weights: "data/models/savi/best_val.pth", but still the code is stuck. Before your push, the code was training the model and the checkpoints were getting saved, but now the code is stuck. Could you please fix this issue?

ChanganVR commented 3 years ago

Hi @gtatiya you don't need to set pretrained_weights for pretraining, this is only needed for finetuning. The goal predictor by default loads this weight: https://github.com/facebookresearch/sound-spaces/blob/721333f8c034384613bd6510a122506cb5446f38/ss_baselines/savi/models/belief_predictor.py#L97.

I ran this code again locally and it worked just fine. I'm not sure what happened to you. It could possibly freeze due to the large GPU memory ussage, which in case, you can reduce the memory size. Also you could print some statements to see where the code is getting stuck.

gtatiya commented 3 years ago

Thank you! Yes, this could be because of GPU memory usage. Could you please tell me how to reduce the memory size?

ChanganVR commented 3 years ago

@gtatiya there are many parameters you could tweak to reduce the GPU memory usage, including external memory size, hidden feature size, mini batch size and etc at the cost of performance drop. I'd suggest to start with reducing external memory size, which in my experience affects the GPU memory a lot.

gtatiya commented 3 years ago

Thank you. Could you please specify how to reduce external memory size?

I changed the NUM_PROCESSES to 4 here: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml#L3, and the training started, but it again got stuck at 388th checkpoint. Here is the logs: pre-train_model_savi_log.txt. Could you please help me figure out what is the issue with it?

ChanganVR commented 3 years ago

Oh right, I totally forgot about the NUM_PROCESSES parameter. To reduce the external memory size, you just need to change this number: https://github.com/facebookresearch/sound-spaces/blob/721333f8c034384613bd6510a122506cb5446f38/ss_baselines/savi/config/semantic_audionav/savi.yaml#L36

Based on the log, I can't really tell what was wrong. But since the model weights are saved, are you able to resume the training?

gtatiya commented 3 years ago

I am running savi_pretraining.yaml, and it has memory_size of 1:

https://github.com/facebookresearch/sound-spaces/blob/0e87180459a5c9901bd1b17fe83405ebe57b9360/ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml#L40

You recently, made this change, so I ran it again, with NUM_PROCESSES = 4, but it still got stuck.

What changes I need to make to resume training?

ChanganVR commented 3 years ago

You don't need make changes. The resuming function is implemented in the code already: https://github.com/facebookresearch/sound-spaces/blob/0e87180459a5c9901bd1b17fe83405ebe57b9360/ss_baselines/savi/ddppo/algo/ddppo_trainer.py#L325-L327

gtatiya commented 3 years ago

Thank you. I was able to complete savi_pretraining.yaml step. But, I am facing issues with savi.yaml step:

Could you please help?

ChanganVR commented 3 years ago

--eval-best is for evaluating the best checkpoint on the test set based on the validation curve.

You need to set the weights in here: https://github.com/facebookresearch/sound-spaces/blob/0e87180459a5c9901bd1b17fe83405ebe57b9360/ss_baselines/savi/config/semantic_audionav/savi.yaml#L60

For the second point, see this issue: https://github.com/facebookresearch/sound-spaces/issues/51#issuecomment-902861943

gtatiya commented 3 years ago

I am setting pretrained_weights: "data/models/savi/data/ckpt.399.pth", but still I am getting that error. Do you know why?

Could you please specify how to find the best pre-trained checkpoint?

ChanganVR commented 3 years ago

The best validation checkpoint should be based on the validation curve, that is, you evaluate every checkpoint on the validation set and pick the best one to continue training for the next stage.

If you're talking about the --eval-best error, you'll need to get that curve first.

gtatiya commented 3 years ago

Thank you. Do you have an automated way to evaluate every checkpoint on the validation set? There are 400 checkpoints, so it would be hard to evaluate them manually.

I used the last checkpoint (ckpt.399.pth) from savi_pretraining.yaml and trained savi.yaml, and evaluated on test set using this command: python ss_baselines/savi/run.py --run-type eval --exp-config ss_baselines/savi/config/semantic_audionav/savi.yaml EVAL_CKPT_PATH_DIR data/models/savi/data/ckpt.399.pth EVAL.SPLIT test USE_SYNC_VECENV True RL.DDPPO.pretrained False, and the results were:

2021-08-26 10:01:59,366 Average episode reward: 4.563702
2021-08-26 10:01:59,367 Average episode distance_to_goal: 13.225000
2021-08-26 10:01:59,367 Average episode normalized_distance_to_goal: 0.578507
2021-08-26 10:01:59,367 Average episode success: 0.113000
2021-08-26 10:01:59,367 Average episode spl: 0.081052
2021-08-26 10:01:59,367 Average episode softspl: 0.307712
2021-08-26 10:01:59,367 Average episode na: 113.229000
2021-08-26 10:01:59,367 Average episode sna: 0.043960
2021-08-26 10:01:59,367 Average episode sws: 0.089000

When I used the pre-trained weights you provide, the results were:

2021-08-26 10:53:22,413 Average episode reward: 8.952902
2021-08-26 10:53:22,414 Average episode distance_to_goal: 9.326000
2021-08-26 10:53:22,414 Average episode normalized_distance_to_goal: 0.392776
2021-08-26 10:53:22,414 Average episode success: 0.233000
2021-08-26 10:53:22,414 Average episode spl: 0.154922
2021-08-26 10:53:22,414 Average episode softspl: 0.348543
2021-08-26 10:53:22,414 Average episode na: 163.308000
2021-08-26 10:53:22,414 Average episode sna: 0.121521
2021-08-26 10:53:22,414 Average episode sws: 0.139000

Why do you think there is a huge difference? Is it just because I did not use the best checkpoint from savi_pretraining.yaml?

Why the results I got from the pre-trained weights you provided are not same as the results in your semantic AVN paper?

ChanganVR commented 3 years ago

Thank you. Do you have an automated way to evaluate every checkpoint on the validation set? There are 400 checkpoints, so it would be hard to evaluate them manually.

This function is made for this. https://github.com/facebookresearch/sound-spaces/blob/0e87180459a5c9901bd1b17fe83405ebe57b9360/ss_baselines/common/base_trainer.py#L68-L122 It monitors all the checkpoints in a specified directory and evaluates them once a new one is available. I usually run a separate process for evaluation.

Why do you think there is a huge difference? Is it just because I did not use the best checkpoint from savi_pretraining.yaml?

There could be many reasons for this. How many GPUs are you using and how long have you trained the model? You'll get a better idea by plotting the validation curve as instructed above. Then you'll know if the model has converged.

Why the results I got from the pre-trained weights you provided are not same as the results in your semantic AVN paper?

Which result is not consistent?

gtatiya commented 3 years ago
ChanganVR commented 3 years ago

If you don't provide EVAL_CKPT_PATH_DIR and just run the eval mode, by default it will always evaluate all checkpoints under that directory.

As I mentioned earlier in another post, if you change the number of GPUs, you might also want to change NUM_UPDATES as the default number of GPUs are 32. If you can evaluate all the validation checkpoints, you will know if the model has converged based on the validation performance curve.

The command is correct and yes, this command is for unheard sounds setting. The performance is indeed slightly lower than I first evaluated the model and uploaded the weights. Maybe some updates broke the consistency in some way. I'll look into that and keep you updated!