Closed vincent-antaki closed 2 years ago
I'm interested to pick up this issue.
To make HyperparamsRepo multiprocess safe, i am thinking to have solution where first we compute all the trials and store them in a data structure once all the processing is completed at the end then we will proceed to store it as part of JSON files.
So let's say something like below
{ "<cpu_core_id>":{"<trial_id>":[<hps_info>]} }
at the end of the processing, we will go iterate over the above structure and generate JSON files like what is the best hps and what were the trials etc.
Hello Rohith,
The suggested solution isn't ideal, mostly because we save the trial json multiple time during training (at each epoch usually). I personally advocate for the two following solutions:
Not having the hash be dependent on the hyperparameters would also save us a headache or two when dealing with issue #496 and #504. Furthermore, it would allow us to solve the problem where the logger currently uses the trial number to generate the log file path instead of the trial hash (I though there was an opened issue related to this but I can't find it).
Describe the bug So this bug involves HyperparameterJSONRepo and the get_trial_hash_function. It is very unlikely to happen if your model has hyperparameters which are reals and more likely if your hp space is small, discrete and the pipeline computation are quick.
Here is the gist of it:
In case of non-parallel execution, this means that we are overwriting the previous trial with the same hyperparameters. In case of parallel execution, this is a potential race hazard. See example to reproduce.
To Reproduce Take the test_logger_automl and the the following:
Expected behavior and suggested fix First of I'm not convinced that the the overwriting behaviour in non parallel setting is problematic. If the hyperparameters are the same, we shouldn't see any difference in outputs unless they are due to sampling random values. If that is considered problematic, then users should set a random seed as part of their hyperparameter space.
Second, I'd assume not crashing when running in parallel is the expected behaviour. To achieve that, development of Multiprocess safe HyperparamsRepo should be considered.