Closed sotte closed 5 years ago
How are you setting the COMET_API_KEY? Can you post your key-redacted command line call?
Thanks for the quick response!
I set the COMET_API_KEY
as environment variable. I don't think the COMET_API_KEY
is the issue, because the first experiment ends up in the web interface and the second one is actually marked as "started". (I actually have an assert checking the COMET_API_KEY
in the very beginning to fail early even before I instantiate the Optimizer
which sometimes takes a few seconds.)
I call the hyper parameter tuner something like this
python tune.py comet_evaluation.mnist_demo.baseline comet/mnist_demo/exp_baseline.pcs 50
where tune.py
is my little helper script that I posted earlier.
Thanks, that is helpful. I'll see if I can replicate. We are also adding more feedback on these types of failures. To get more info now, you can also set these environment variables:
export COMET_LOGGING_FILE="/some/file.log"
export COMET_LOGGING_FILE_LEVEL="DEBUG"
and then look in the file /some/file.log
Thanks! I'll run my experiments with logging enabled and keep you posted if I find something out that looks odd.
I couldn't replicate your error exactly, but I suspect it has something to do with you creating the Experiment over in the non-standard imported function. The Optimizer needs to operate in the context of a live Experiment, and you are creating (and destroying) the Experiment object over in the imported function. Perhaps that is in another thread, or process. Or perhaps the Experiment object is being deleted in garbage collection.
My recommendation would be to create the Experiment at the top if you can, and use it throughout. Or, if you can't do that, create the experiment closer to the Optimizer (and before it is created) and pass it to functions as needed.
Let me know if you have any questions. Also, let us know if you resolve the issue. Thanks for your details in reproducing this!
Interesting. Maybe it's related to the import function. For completes-sake this is the function:
def import_from_module(something: str, module_str: str) -> Any:
"""Import `something` from the given `module_str` and return it."""
return getattr(importlib.import_module(module_str), something)
I don't think the experiment is created in another thread or process and it should only be garbage collected after the experiment is done.
I'll give your recommendations a try and get back to you (probably Monday).
Recreating the Experiment
should be fine, at least in your optimizer example the Experiment
is recreated inside the loop as well. Maybe you can update the docs and also add a hyper parameter tuning example in the example repo.
I'll update the examples to be better. You are right re: the example, and that works.
I think the issue is in a conflict with the manner of import, and the method that we use to find the current live experiment. With your code, I hope to replicate and find a fix.
Quick update: No code changes, but the behavior I described is not completely reproducible. I still get the COMET ERROR: Failed to setup the std logger
after the first run, but this time the following experiments appear in the the web ui (without any output from stdout of course).
Update: and it got stuck on a later experiment. No updates of the UI anymore.
Ok, thanks. We're working on the std logger error. Will update when we have more info.
I'm running a version without any "magic imports" and I still get the COMET ERROR: Failed to setup the std logger
error. The code looks something like this:
import comet_ml
# ...
def fit_fn(exp, params):
# ...
@click.argument("param_file", type=str)
@click.argument("n_trials", type=int, default=10)
def main(param_file, n_trials):
optimizer = comet_ml.Optimizer()
optimizer.set_params_file(param_file)
for i in range(n_trials):
suggestion = optimizer.get_suggestion()
params = {k: suggestion[k] for k in suggestion}
exp = comet_ml.Experiment(project_name=PROJECT_NAME, workspace="MYWORKSPACE")
score_name, score_val = fit_fn(exp, params)
suggestion.report_score(score_name, score_val)
if __name__ == "__main__":
main()
This means the problem was not the "magic import" function.
I was not able to replicate (with our soon-to-be-released version of comet_ml). Here is a complete script:
# file: issue_198-2.py
import comet_ml
import click
import random
def fit_fn(exp, params):
exp.log_parameters(params)
return "test", random.random()
PROJECT_NAME="prj-name"
@click.command()
@click.argument("param_file", type=str)
@click.argument("n_trials", type=int, default=10)
def main(param_file, n_trials):
optimizer = comet_ml.Optimizer()
optimizer.set_params_file(param_file)
for i in range(n_trials):
suggestion = optimizer.get_suggestion()
params = {k: suggestion[k] for k in suggestion}
exp = comet_ml.Experiment(project_name=PROJECT_NAME)
score_name, score_val = fit_fn(exp, params)
suggestion.report_score(score_name, score_val)
if __name__ == "__main__":
main()
File: pf1.pcs
x integer [1, 10] [10]
y real [1, 10] [1.0]
Command line:
python3 issue_198-2.py pf1.pcs 10
No errors were reported. So, we either have it fixed, or I still couldn't replicate. I'll report back when the new version of comet_ml is available.
Thanks @dsblank! I ran your script and it worked. No errors.
In my script, all cometml parts are exactly like in yours, but it still throws the "std logger error". So I did a binary search and removed everything from that script until I located the error.
And now it get weird! It seems that the logger does not work anymore (COMET ERROR: Failed to setup the std logger
) once I move my pytorch model to the GPU. How can that be?
Respectively, when I don't move the model to the GPU I don't get the logger error. Without cometml the console output works properly.
Here is the minimal example:
# file: minimal.py
import comet_ml # noqa
import os
import random
import time
import click
import torchvision.models
PROJECT_NAME = "mnist_demo"
WORKSPACE = "my_workspace_name"
def fit_fn(exp, params):
print("PARAMS")
exp.log_parameters(params)
print("MODEL")
module = torchvision.models.resnet18(pretrained=True)
module.cuda() # <-- this line causes the std logging error
val_acc = random.random()
exp.log_metric("val_acc", val_acc)
time.sleep(4)
return "val_acc", val_acc
@click.command()
@click.argument("param_file", type=str)
@click.argument("n_trials", type=int)
def main(param_file, n_trials):
optimizer = comet_ml.Optimizer()
optimizer.set_params_file(param_file)
for i in range(n_trials):
suggestion = optimizer.get_suggestion()
params = {k: suggestion[k] for k in suggestion}
exp = comet_ml.Experiment(project_name=PROJECT_NAME, workspace=WORKSPACE)
score_name, score_val = fit_fn(exp, params)
suggestion.report_score(score_name, score_val)
if __name__ == "__main__":
main()
The PCS file is like yours. I start the script like this.
python minimal.py pf1.pcs 4
For completeness sake:
$ python
Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 18:33:04)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.0.0'
I ran your example with Python 3.6, torch 1.0.0 and did not get the std logger error. Is English the default language of the machine? I have seen some reference to errors with unicode and logging in the slack channel.
Here is the output from my run (the "Failed to parse" will be removed in the next version of the SDK, FYI):
$ python3 minimal.py pf1.pcs 4
COMET WARNING: Failed to parse argv values. Falling back to naive parsing.
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/2cc78c9e5cab4ca18d3d16ffad58fc88
PARAMS
MODEL
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/2cc78c9e5cab4ca18d3d16ffad58fc88
COMET WARNING: Failed to parse argv values. Falling back to naive parsing.
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/32937167b0d148c48c2158b9282f206d
PARAMS
MODEL
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/32937167b0d148c48c2158b9282f206d
COMET WARNING: Failed to parse argv values. Falling back to naive parsing.
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/027d34c12cb748c391449835229d4a6b
PARAMS
MODEL
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/027d34c12cb748c391449835229d4a6b
COMET WARNING: Failed to parse argv values. Falling back to naive parsing.
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/0187147c9cc645939cf2b2b0a47ba0db
PARAMS
MODEL
COMET INFO: Uploading stats to Comet before program termination (may take several seconds)
COMET INFO: Still uploading
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/0187147c9cc645939cf2b2b0a47ba0db
COMET INFO: Uploading stats to Comet before program termination (may take several seconds)
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/027d34c12cb748c391449835229d4a6b
COMET INFO: Uploading stats to Comet before program termination (may take several seconds)
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/32937167b0d148c48c2158b9282f206d
COMET INFO: Uploading stats to Comet before program termination (may take several seconds)
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/2cc78c9e5cab4ca18d3d16ffad58fc88
Now trying to upgrade to Python 3.7 to see if that makes a difference...
@dsblank thanks for all your work! I appreciate it!
It's a fresh GCE instance with the "Google Deep Learning VM" image version m16 for pytorch. LANG is "en_US.UTF-8". I don't print any weird unicode characters.
The instance was created with a script that looks something like this:
export IMAGE_FAMILY="pytorch-latest-gpu"
export ZONE="..."
export INSTANCE_NAME="..."
export PROJECT="..."
gcloud compute instances create $INSTANCE_NAME \
--project=$PROJECT \
--zone=$ZONE \
--image-family=$IMAGE_FAMILY \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--accelerator="type=nvidia-tesla-p100,count=1" \
--metadata="install-nvidia-driver=True" \
--preemptible
@sotte Thanks for the continued exploring... there is one other thing you can do. Can you set these environment vars, and send me (doug@comet.ml) the resulting output file comet.log
?
export COMET_LOGGING_FILE="comet.log"
export COMET_LOGGING_FILE_LEVEL="DEBUG"
I assume that script worked for you with python 3.7?
I enabled debug logging yesterday (?) but the logs did not show anything suspicious. I'll double check later again.
Hit a snag trying to get a machine with Python 3.7 and a GPU. But not giving up :)
Update: couldn't find an easy way to get Python 3.7 setup with GPUs. I will see if I can try via Google Cloud. In the meantime, you can also try changing the type of logger. Rather than using the default, you can try a simpler version like this:
exp = comet_ml.Experiment(project_name=PROJECT_NAME,
workspace=WORKSPACE,
auto_output_logging="simple")
I've been too busy to try python 3.6 and to recheck the logs, sorry :(
I hope I can try the simple logger.
I'll get back to you when I have any news.
No problem! We'll keep working on this end too.
I believe that all issue related to this problem have been addressed. Please reopen and describe if you have any questions or issues. Thanks for reporting!
I'm testing the hyper parameter tuning and after the first experiment I always get the following errors and the experiments don't continue:
Any ideas what could cause this?
Here is my setup. Because the main hyper parameter loop is always the same I put it into a little helper script. This allows me to optimize arbitrary experiments assuming they have a
main
function that accepts the parameters as argument.The actual module with the experiment looks something like this:
I don't see anything that could be wrong, but I guess there must be something wrong :)