Open gtatiya opened 3 years ago
Thanks. It has been fixed.
Thanks. Now, I do not get that error, but I get this error:
INFO:root:AudioNavSMTNet ===> Freezing goal, visual, fusion encoders!
Traceback (most recent call last):
File "ss_baselines/savi/run.py", line 144, in <module>
main()
File "ss_baselines/savi/run.py", line 107, in main
trainer.train()
File "/home/gyan/Documents/sound-spaces/ss_baselines/savi/ddppo/algo/ddppo_trainer.py", line 239, in train
self._setup_actor_critic_agent(ppo_cfg)
File "/home/gyan/Documents/sound-spaces/ss_baselines/savi/ddppo/algo/ddppo_trainer.py", line 147, in _setup_actor_critic_agent
pretrained_state = torch.load(self.config.RL.DDPPO.pretrained_weights, map_location="cpu")
File "/home/gyan/miniconda3/envs/avn/lib/python3.6/site-packages/torch/serialization.py", line 581, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/gyan/miniconda3/envs/avn/lib/python3.6/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/gyan/miniconda3/envs/avn/lib/python3.6/site-packages/torch/serialization.py", line 211, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'data/models/savi/data/ckpt.50.pth'
This is the list of files in savi:
There is no data folder. I ran python ss_baselines/savi/pretraining/audiogoal_trainer.py --run-type train --model-dir data/models/savi --predict-label
before and this is its logs:
pre-train_audiogoal_savi_log.txt
which config are you using? You need to update the weight path with the best savi model you pretrained
I am running the code mentioned here: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/README.md.
I am not changing the default config file.
It is not mentioned in the README that I need to update the weight path with best savi model. There is a file called best_val.pth
is that the best savi model? Shall I rename best_val.pth
to ckpt.50
and copy it to data/models/savi/data/
? Why your code is looking for data/models/savi/data/ckpt.50.pth
?
I should've been more clear. The model is first pretrained with memory size 1 (savi_pretraining.yaml) and then trainedw with full memory size (savi.yaml). You'll need to update ckpt.50 to the best checkpoint from the pretraining when doing second step.
best_val.pth is the best checkpoint for the second step, not for the first one. I'll update the description accordingly.
I am not sure how to update ckpt.50 to the best checkpoint. The audiogoal_trainer.py
script only generate these files:
best_val.pth ckpt.1.pth ckpt.3.pth ckpt.8.pth
ckpt.0.pth ckpt.2.pth ckpt.4.pth tb
There is no ckpt.50
file. Could you please explain how to update ckpt.50 to the best checkpoint from the pretraining when doing second step?
Also, you said best_val.pth
is the best checkpoint for the second step. The the first step if giving me the error above, and thus I never ran the second step (savi.yaml), then how come best_val.pth
is generated?
Somehow, the config is wrong. Just setting this value (https://github.com/facebookresearch/sound-spaces/blob/1e2e8a35f4a205639778d9d77044c699637a4085/ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml#L58) to False should work for you.
In the step 1 (in https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/README.md#usage), predict_label is True, and predict_location False. But, I think, there is a bug: you are also optimizing location prediction even though the network is not predicting the location, and it is only predicting the classes. Here you are also computing the loss in location prediction in the case of only predict_label: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/pretraining/audiogoal_trainer.py#L118. I think instead of regressor_loss = regressor_criterion(predicts[:, -2:], gts[:, -2:])
it should be regressor_loss = torch.tensor([0], device=self.device)
. Is that correct?
You mentioned in the README that we need to update the pretrained_weights path in savi.yaml. After running audiogoal_trainer.py
the best model is saved in best_val.pth
, so can't we always use best_val.pth
in savi_pretraining.yaml and savi.yaml?
For your first question, I tried predicting labels alone and predicting both. The classsification accuracies for them were similar. I did end up using the label predictor trained with joint traning but I think this does not make a big difference.
That is definitely okay. It's just that the model directory might change so you need to update the path.
In the case where only predict_label
is True
, the model has 21 outputs each belonging to a class, but why are you computing loss for the prediction of last 2 classes based on ground truth location here: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/pretraining/audiogoal_trainer.py#L118? In other words, why in the case of predict_label
you used regressor_loss = regressor_criterion(predicts[:, -2:], gts[:, -2:])
where as in the case of predict_location
you used classifier_loss = torch.tensor([0], device=self.device)
?
that looks like a bug, when predicting label only, the loss should only consist of the classifier loss. I'll replace that line with regressor_loss = torch.tensor([0], device=self.device)
Thanks, could you please fix this change, so that I can re-run code?
I have a question regarding how to run savi code:
data/models_temp/savi/best_val.pth
.pth
files in data/models_temp/savi/data
. Does it automatically uses best_val.pth
? Or I need to modify something to train it?pretrained_weights
path in savi.yaml with the best checkpoint of pre-training. How do I find the best checkpoint of pre-training? Can you automate saving the best model, like you are doing in audiogoal_trainer.py?Yes, I'll make this change and push soon.
Both the pretraining and full training steps take the best model of the previous step to continue training. Yeah, it would be good to automate this process and I'll try to add some code for that.
So, does savi_pretraining.yaml
will use data/models/savi/best_val.pth
automatically or I need to make some change for that?
I realized later that the savi-pretraining model needs to be validated separately from training. And for that reason, I didn't automate this process. But as I mentioned earlier, both of these two steps only need to be trained once and should be the same for variants or ablations of the main model, and thus manually updating the weights shouldn't be too much of a cost.
I completed the first step of training the label predictor using audiogoal_trainer.py script. Now, I want to pre-train the savi model with savi_pretraining.yaml, and my question is do I need to make any changes manually to make the savi model use data/models/savi/best_val.pth
generated by audiogoal_trainer.py? When I tried removing checkpoints generated by audiogoal_trainer.py, and ran the script to pre-train the savi model with savi_pretraining.yaml, it still seems to train and save checkpoints in data/models/savi/data/
, so I am not sure if it is using data/models/savi/best_val.pth
or not. Could you please specifically clarify this issue?
I'm not sure if you're aware of this, but this line of code loads the data: https://github.com/facebookresearch/sound-spaces/blob/9e3318b5e54f3246a606d2373ca60e4b7efc11f1/ss_baselines/savi/models/belief_predictor.py#L97
And since I'm specifying the pretrained weights here so you do need to update the path with your own model weights.
Thanks for answering. I do not have pretrained_weights
in my data
directory, but still I was able to pre-train savi model with savi_pretraining.yaml. After training, there were 400 checkpoints in data/savi/data
directory. Do you know why your code did not use data/pretrained_weights/semantic_audionav/savi/label_predictor.pth
?
I think, it is because in savi_pretraining.yaml
, you have set use_belief_predictor
to False
: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml#L37. Is it supposed to be True
?
Could you please help me train the savi model in the same way published in the paper?
@ChanganVR, Could you please answer the questions I asked above?
@gtatiya yeah, I was just checking the configuration and sorry about the delay. I found it was because when I cleaned up my code, the pretraining configuration somehow got messed up. I just pushed the new config files. This should work now.
You can check this commit for the detailed changes I made: https://github.com/facebookresearch/sound-spaces/commit/721333f8c034384613bd6510a122506cb5446f38
Thank you for making changes to fix the issue, but I am still facing issue to run savi code. I am trying to run python ss_baselines/savi/run.py --exp-config ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml --model-dir data/models/savi
, and before running that, I added this pretrained_weights: "data/models/savi/best_val.pth"
, here: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml#L63. The issue is the code is stuck, and nothing is happening. Please see the logs attached here: pre-train_model_savi_log.txt. I also tried removing pretrained_weights: "data/models/savi/best_val.pth"
, but still the code is stuck. Before your push, the code was training the model and the checkpoints were getting saved, but now the code is stuck. Could you please fix this issue?
Hi @gtatiya you don't need to set pretrained_weights
for pretraining, this is only needed for finetuning. The goal predictor by default loads this weight: https://github.com/facebookresearch/sound-spaces/blob/721333f8c034384613bd6510a122506cb5446f38/ss_baselines/savi/models/belief_predictor.py#L97.
I ran this code again locally and it worked just fine. I'm not sure what happened to you. It could possibly freeze due to the large GPU memory ussage, which in case, you can reduce the memory size. Also you could print some statements to see where the code is getting stuck.
Thank you! Yes, this could be because of GPU memory usage. Could you please tell me how to reduce the memory size?
@gtatiya there are many parameters you could tweak to reduce the GPU memory usage, including external memory size, hidden feature size, mini batch size and etc at the cost of performance drop. I'd suggest to start with reducing external memory size, which in my experience affects the GPU memory a lot.
Thank you. Could you please specify how to reduce external memory size?
I changed the NUM_PROCESSES to 4 here: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml#L3, and the training started, but it again got stuck at 388th checkpoint. Here is the logs: pre-train_model_savi_log.txt. Could you please help me figure out what is the issue with it?
Oh right, I totally forgot about the NUM_PROCESSES
parameter. To reduce the external memory size, you just need to change this number: https://github.com/facebookresearch/sound-spaces/blob/721333f8c034384613bd6510a122506cb5446f38/ss_baselines/savi/config/semantic_audionav/savi.yaml#L36
Based on the log, I can't really tell what was wrong. But since the model weights are saved, are you able to resume the training?
I am running savi_pretraining.yaml, and it has memory_size of 1:
You recently, made this change, so I ran it again, with NUM_PROCESSES
= 4, but it still got stuck.
What changes I need to make to resume training?
You don't need make changes. The resuming function is implemented in the code already: https://github.com/facebookresearch/sound-spaces/blob/0e87180459a5c9901bd1b17fe83405ebe57b9360/ss_baselines/savi/ddppo/algo/ddppo_trainer.py#L325-L327
Thank you. I was able to complete savi_pretraining.yaml
step. But, I am facing issues with savi.yaml
step:
--eval-best True
, I get this error:
No max index is found in data/models/savi/tb
Evaluating the best checkpoint: data/models/savi/data/ckpt.-1.pth
Traceback (most recent call last):
File "ss_baselines/savi/run.py", line 144, in <module>
main()
File "ss_baselines/savi/run.py", line 95, in main
config = get_config(args.exp_config, args.opts, args.model_dir, args.run_type, args.overwrite)
File "/home/i21_gtatiya/projects/sound-spaces/ss_baselines/savi/config/default.py", line 264, in get_config
config.merge_from_list(opts)
File "/home/i21_gtatiya/miniconda3/envs/avn/lib/python3.6/site-packages/yacs/config.py", line 226, in merge_from_list
cfg_list
File "/home/i21_gtatiya/miniconda3/envs/avn/lib/python3.6/site-packages/yacs/config.py", line 545, in _assert_with_logging
assert cond, msg
AssertionError: Override list has odd length: ['True', 'EVAL_CKPT_PATH_DIR', 'data/models/savi/data/ckpt.-1.pth']; it must be a list of pairs
pretrained_weights: "data/models/savi/data/ckpt.399.pth"
, training finishes very quickly, I am not sure if that is supossed to hapen. Here are the logs:
train_model_savi_log.txtCould you please help?
--eval-best
is for evaluating the best checkpoint on the test set based on the validation curve.
You need to set the weights in here: https://github.com/facebookresearch/sound-spaces/blob/0e87180459a5c9901bd1b17fe83405ebe57b9360/ss_baselines/savi/config/semantic_audionav/savi.yaml#L60
For the second point, see this issue: https://github.com/facebookresearch/sound-spaces/issues/51#issuecomment-902861943
I am setting pretrained_weights: "data/models/savi/data/ckpt.399.pth"
, but still I am getting that error. Do you know why?
Could you please specify how to find the best pre-trained checkpoint?
The best validation checkpoint should be based on the validation curve, that is, you evaluate every checkpoint on the validation set and pick the best one to continue training for the next stage.
If you're talking about the --eval-best
error, you'll need to get that curve first.
Thank you. Do you have an automated way to evaluate every checkpoint on the validation set? There are 400 checkpoints, so it would be hard to evaluate them manually.
I used the last checkpoint (ckpt.399.pth
) from savi_pretraining.yaml
and trained savi.yaml
, and evaluated on test set using this command: python ss_baselines/savi/run.py --run-type eval --exp-config ss_baselines/savi/config/semantic_audionav/savi.yaml EVAL_CKPT_PATH_DIR data/models/savi/data/ckpt.399.pth EVAL.SPLIT test USE_SYNC_VECENV True RL.DDPPO.pretrained False
, and the results were:
2021-08-26 10:01:59,366 Average episode reward: 4.563702
2021-08-26 10:01:59,367 Average episode distance_to_goal: 13.225000
2021-08-26 10:01:59,367 Average episode normalized_distance_to_goal: 0.578507
2021-08-26 10:01:59,367 Average episode success: 0.113000
2021-08-26 10:01:59,367 Average episode spl: 0.081052
2021-08-26 10:01:59,367 Average episode softspl: 0.307712
2021-08-26 10:01:59,367 Average episode na: 113.229000
2021-08-26 10:01:59,367 Average episode sna: 0.043960
2021-08-26 10:01:59,367 Average episode sws: 0.089000
When I used the pre-trained weights you provide, the results were:
2021-08-26 10:53:22,413 Average episode reward: 8.952902
2021-08-26 10:53:22,414 Average episode distance_to_goal: 9.326000
2021-08-26 10:53:22,414 Average episode normalized_distance_to_goal: 0.392776
2021-08-26 10:53:22,414 Average episode success: 0.233000
2021-08-26 10:53:22,414 Average episode spl: 0.154922
2021-08-26 10:53:22,414 Average episode softspl: 0.348543
2021-08-26 10:53:22,414 Average episode na: 163.308000
2021-08-26 10:53:22,414 Average episode sna: 0.121521
2021-08-26 10:53:22,414 Average episode sws: 0.139000
Why do you think there is a huge difference? Is it just because I did not use the best checkpoint from savi_pretraining.yaml
?
Why the results I got from the pre-trained weights you provided are not same as the results in your semantic AVN paper?
Thank you. Do you have an automated way to evaluate every checkpoint on the validation set? There are 400 checkpoints, so it would be hard to evaluate them manually.
This function is made for this. https://github.com/facebookresearch/sound-spaces/blob/0e87180459a5c9901bd1b17fe83405ebe57b9360/ss_baselines/common/base_trainer.py#L68-L122 It monitors all the checkpoints in a specified directory and evaluates them once a new one is available. I usually run a separate process for evaluation.
Why do you think there is a huge difference? Is it just because I did not use the best checkpoint from savi_pretraining.yaml?
There could be many reasons for this. How many GPUs are you using and how long have you trained the model? You'll get a better idea by plotting the validation curve as instructed above. Then you'll know if the model has converged.
Why the results I got from the pre-trained weights you provided are not same as the results in your semantic AVN paper?
Which result is not consistent?
How should I run that eval function? I used this command: python ss_baselines/savi/run.py --run-type eval --exp-config ss_baselines/savi/config/semantic_audionav/savi.yaml EVAL_CKPT_PATH_DIR data/models/savi/data/ckpt.399.pth EVAL.SPLIT test USE_SYNC_VECENV True RL.DDPPO.pretrained False
, but it believe it only evaluated using ckpt.399.pth. What is the command to evaluate on all the checkpoints?
I used 1 GPU, and I believe your code only uses 1 GPU at a time. Here, you load the model on only one GPU: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/ddppo/algo/ddppo.py#L77. I have 4 GPUs, how can I use all of them? I used the defaults setting in your config file (NUM_UPDATES: 20000
and CHECKPOINT_INTERVAL: 50
), I only change NUM_PROCESSES: 4
, and there are 400 checkpoints after training. Did you used different settings for training than the config you provided?
I think, the results are similar to Table 1 (Unheard Sounds) of the paper, I asked because I got slightly less performance with the weights you provided. So, I might be doing something wrong. I used python ss_baselines/savi/run.py --run-type eval --exp-config ss_baselines/savi/config/semantic_audionav/savi.yaml EVAL_CKPT_PATH_DIR data/pretrained_weights/semantic_audionav/savi/best_val.pth EVAL.SPLIT test USE_SYNC_VECENV True RL.DDPPO.pretrained False
. Do you think, this is correct command?
If you don't provide EVAL_CKPT_PATH_DIR
and just run the eval mode, by default it will always evaluate all checkpoints under that directory.
As I mentioned earlier in another post, if you change the number of GPUs, you might also want to change NUM_UPDATES
as the default number of GPUs are 32. If you can evaluate all the validation checkpoints, you will know if the model has converged based on the validation performance curve.
The command is correct and yes, this command is for unheard sounds setting. The performance is indeed slightly lower than I first evaluated the model and uploaded the weights. Maybe some updates broke the consistency in some way. I'll look into that and keep you updated!
Hi @ChanganVR,
When I run
python ss_baselines/savi/run.py --exp-config ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml --model-dir data/models/savi
, I get this error:Could you please fix this error.