Closed AlexFridman closed 4 years ago
Hey I've actually had hanging issues too, and I'm not sure what's causing it because it seems to happen randomly (i.e. not every single time). I'm pretty sure double slashes don't affect anything, but to be safe, I've gone and replaced all my path-forming strings with os.path.join, in this repo, record-keeper, and easy-module-attribute-getter. Hopefully the changes didn't break anything!
If it still hangs, you could try using my script_wrapper.sh. It's a hacky solution but works well for me. Basically it checks the folder in which your experiment should be saving stuff. If there have been no updates to that folder or its subfolders in X minutes, then it kills the process, and starts a new one. (The run_bayesian_optimization.py script always resumes from the latest possible iteration). Here's how you can use it if you're interested:
./script_wrapper.sh <name of your script> <experiment_name>
In your case, let's say I pasted your bash command into "bayesian_script.sh". Then I would run:
./script_wrapper.sh bayesian_script.sh experiments_opt
This will kill the process if there have been no changes in "/home/blah/experiments_opt" and its subfolders, in X minutes (where you set X in process_checker.sh).
If a lot of hanging occurs, then you'll probably end up with extra experiment folders in experiments_opt. For example, if you run 10 iterations of bayesian optimization, then you might end up with say 13 experiment folders. This shouldn't affect the final "best_parameters" yaml file, because the script keep track of the folders that actually finished without hanging.
The only real downside to using the script_wrapper is that you have to use kill -9
Also FYI, with the latest version of easy-module-attribute-getter, you can use the \~OVERRIDE\~ flag within nested dictionaries. For example, if you want to change the optimizer for the trunk model only and not the embedder, you can do:
python run.py \
--experiment_name test2 \
--optimizers {trunk_optimizer~OVERRIDE~: {RMSprop: {lr: 0.01}}}
(Previously, I was only using \~OVERRIDE\~ at the top level of nested dictionaries, so in the above example you used to have to redefine the embedder optimizer.)
Hope this helps!
Thank you, @KevinMusgrave, for the quick response! I'll try it shortly.
And thank you once again for the pytorch-metric-learning
package. It's well written and easy to use.
@AlexFridman I may have fixed the hanging issue. I think it was caused by the pytorch dataloader processes not being killed properly, so in the latest commit, I'm manually deleting the tester and trainer objects here:
Thank you, @KevinMusgrave! I suspected it was related to multiprocessing staff. It's typical when code just freezes w/o any stack trace.
One more question about the benchmarker:
We have a huge dataset and manually define partition scheme in a dataset class (like only 1% for validation) because it fails on the evaluation stage when tries to put the whole dataset in faiss-gpu.
As I see there's a call of eval_model
method during training w/o specifying splits_to_exclude
parameter. Therefore (as I understand) it tries to run evaluation on our huge dataset even if we've specified eval_reference_set: compared_to_self
and splits_to_eval=val
.
Is it possible to somehow overcome this issue w/o decreasing train size?
@AlexFridman Thanks for pointing out this issue. Now in the latest commit, you should be able to do --splits_to_eval val
, and during the validation step of training, it will compute embeddings and accuracy only for the val split. (If you don't specify splits_to_eval, then the default is to compute accuracy for all splits, excluding the test set.)
Thank you very much, @KevinMusgrave!
Hello, @KevinMusgrave! It looks like we have the same issue in bayess_opt script. Link. It should be val
, I guess, but not sure. Could you please take a look? Thanks!
Hmm, the assumption in the bayesian optimization script is that you'll do optimization based on the validation set(s), and then test the best parameters on the test set. (The splits_to_eval variable appears in the function "test_best_model", which is called at the very end of the script.
My colleague told me when she runs BO script, it also runs evaluation on the train part of the data. Maybe there's another reason for it.
1st reason: we don't have test
at our splitting...
2nd reason: in run.py splits_to_eval
has a default value ['val']
, in BO script it does not have a default value for trails runs and that's why it uses all splits to eval.
Should we set splits_to_eval
in BO script to ['val']
during HP search?
Re: 1st reason, I assume you're setting "special_split_scheme_name" to "predefined", since you're defining the train/val split yourself. If you're not setting that flag, then train/val/test splits will be created as described here: https://github.com/KevinMusgrave/powerful-benchmarker#split-schemes-and-cross-validation
Re: 2nd reason, actually the current default value for splits_to_eval in run.py is None, which means use all non-test splits. The bayes_opt script uses the same default value as run.py, so you're right, you'll have to set splits_to_eval to val.
But don't you think in BO script this parameter has to be set to val
by default (while HP search) as well as in run.py script?
I don't think so. For my own purposes, I like to check accuracy on both the train and val set. In other words, during training, I get to see the accuracy on the train and val set, and then at the very end of bayesian optimization, I see the performance of the best model on the test set. (The best model is chosen based on val set accuracy.)
Got it. Thank you, Kevin!
Hi, @KevinMusgrave!
Two issues found:
predefined
scheme we're getting an error BaseApiParser does not have attribute meta_record_keeper
here because here meta_record_keeper
was not set because of self.split_manager.split_scheme_names
contains only predefined
.test
(num_trials=3, num_epochs=4, save_interval=1) I see the following saved records (inside meta_eval)
defaultdict(<class 'list'>, {'epoch': [-1, 0], 'NMI_level0': [0.6870503826879762, 0.6299021740129578], 'precision_at_1_level0': [0.9142857142857143,
0.8285714285714286], 'r_precision_level0': [0.7428571428571429, 0.7571428571428571], 'mean_average_r_precision_level0': [0.7136904761904762, 0.7041
666666666666], 'best_epoch': [-1, -1], 'best_accuracy': [0.7136904761904762, 0.7136904761904762]})
Could you please explain how epoch
and best_epoch
are formed? Why -1
and why only 2 records?
Regards, Alex
If you're using "predefined", then cross validation isn't supported. (Sorry, I probably should have mentioned that earlier. Unfortunately I haven't put this functionality in yet.) So for example if your predefined split is train/val/test, then there is only 1 validation set. But all the "meta" stuff is for collecting models from multiple cross-validation folds. Since there is only one validation set when you use "predefined", the "meta" stuff is not applicable, so you should set the "meta_testing_method" flag to null. I think it should work then. You'll still get optimization_plot.html and best_parameters.yaml, and the test set performance will be in
The "epoch" key in that meta log is a misnomer. It should be something like "evaluation iteration". Every time you run meta evaluation, it will just append an incremented value to that list. Most likely, you'll run meta evaluation once, so the list will be [-1, 0], where -1 refers to the untrained model, and 0 refers to the most recent evaluation. (The most recent evaluation always uses the trunk_best and embedder_best models saved in each sub-experiment.) Anyway, at the moment, if you're using predefined, then the meta eval stuff won't be applicable.
@AlexFridman Were you able to get it working with "predefined"?
Hi! I run hp optimization (on my server) and after 1st trial it's freezing. The only thing I changed - I've added my own dataset class and changed the corresponding configuration parameter. Previously, I used to run run.py w/o any problems.
Double slash in the log bellow looks suspicious
bayesian_optimizer_logs//log00000.json
Mu run command
python run_bayesian_optimization.py --bayesian_optimization_n_iter 50 --loss_funcs~OVERRIDE~ {met ric_loss: {MultiSimilarityLoss: {alpha~BAYESIAN~: [0.01, 50], beta~BAYESIAN~: [0.01, 50], base~BAYESIAN~: [0, 1]}}} --mining_funcs~OVERRIDE~ {post_g radient_miner: {MultiSimilarityMiner: {epsilon~BAYESIAN~: [0, 1]}}} --experiment_name test5050_multi_similarity_with_ms_miner --root_experiment_fold er experiments_opt --pytorch_home models
Could you please help? Thanks!