Hyper parameter search fails to setup std logger and stops after first experiment

sotte commented 5 years ago

I'm testing the hyper parameter tuning and after the first experiment I always get the following errors and the experiments don't continue:

COMET ERROR: Failed to setup the std logger
COMET WARNING: Failed to parse argv values. Falling back to naive parsing.

Any ideas what could cause this?

Here is my setup. Because the main hyper parameter loop is always the same I put it into a little helper script. This allows me to optimize arbitrary experiments assuming they have a main function that accepts the parameters as argument.

from comet_ml import Optimizer  # noqa
import click
from somewhere import import_from_module

@click.command()
@click.argument("module", type=str)
@click.argument("param_file", type=str)
@click.argument("n_trials", type=int)
def main(module, param_file, n_trials):
    """Tune the hyper parameters for the given `module` with the `param_file` for `n_trials`."""
    fit_fn = import_from_module("main", module)
    optimizer = Optimizer()
    optimizer.set_params_file(param_file)
    for _ in range(n_trials):
        suggestion = optimizer.get_suggestion()
        params = {k: suggestion[k] for k in suggestion}
        score_name, score_val = fit_fn(**params)
        suggestion.report_score(score_name, score_val)

if __name__ == "__main__":
    main()

The actual module with the experiment looks something like this:

# imports ...
def main(arch, optimizer, lr, batch_size, train_split_percent, n_epochs):
    params = locals()
    exp = comet_ml.Experiment(project_name="PRJ_NAME", workspace="SOME_WORKSPACE")
    exp.log_parameters(params)
    # ...
    return score_name, score_value

I don't see anything that could be wrong, but I guess there must be something wrong :)

dsblank commented 5 years ago

How are you setting the COMET_API_KEY? Can you post your key-redacted command line call?

sotte commented 5 years ago

Thanks for the quick response!

I set the COMET_API_KEYas environment variable. I don't think the COMET_API_KEY is the issue, because the first experiment ends up in the web interface and the second one is actually marked as "started". (I actually have an assert checking the COMET_API_KEY in the very beginning to fail early even before I instantiate the Optimizer which sometimes takes a few seconds.)

I call the hyper parameter tuner something like this

python tune.py comet_evaluation.mnist_demo.baseline comet/mnist_demo/exp_baseline.pcs 50

where tune.py is my little helper script that I posted earlier.

dsblank commented 5 years ago

Thanks, that is helpful. I'll see if I can replicate. We are also adding more feedback on these types of failures. To get more info now, you can also set these environment variables:

export COMET_LOGGING_FILE="/some/file.log" 
export COMET_LOGGING_FILE_LEVEL="DEBUG"

and then look in the file /some/file.log

sotte commented 5 years ago

Thanks! I'll run my experiments with logging enabled and keep you posted if I find something out that looks odd.

dsblank commented 5 years ago

I couldn't replicate your error exactly, but I suspect it has something to do with you creating the Experiment over in the non-standard imported function. The Optimizer needs to operate in the context of a live Experiment, and you are creating (and destroying) the Experiment object over in the imported function. Perhaps that is in another thread, or process. Or perhaps the Experiment object is being deleted in garbage collection.

My recommendation would be to create the Experiment at the top if you can, and use it throughout. Or, if you can't do that, create the experiment closer to the Optimizer (and before it is created) and pass it to functions as needed.

Let me know if you have any questions. Also, let us know if you resolve the issue. Thanks for your details in reproducing this!

sotte commented 5 years ago

Interesting. Maybe it's related to the import function. For completes-sake this is the function:

def import_from_module(something: str, module_str: str) -> Any:
    """Import `something` from the given `module_str` and return it."""
    return getattr(importlib.import_module(module_str), something)

I don't think the experiment is created in another thread or process and it should only be garbage collected after the experiment is done.

I'll give your recommendations a try and get back to you (probably Monday).

Recreating the Experiment should be fine, at least in your optimizer example the Experiment is recreated inside the loop as well. Maybe you can update the docs and also add a hyper parameter tuning example in the example repo.

dsblank commented 5 years ago

I'll update the examples to be better. You are right re: the example, and that works.

I think the issue is in a conflict with the manner of import, and the method that we use to find the current live experiment. With your code, I hope to replicate and find a fix.

sotte commented 5 years ago

Quick update: No code changes, but the behavior I described is not completely reproducible. I still get the COMET ERROR: Failed to setup the std logger after the first run, but this time the following experiments appear in the the web ui (without any output from stdout of course).

Update: and it got stuck on a later experiment. No updates of the UI anymore.

dsblank commented 5 years ago

Ok, thanks. We're working on the std logger error. Will update when we have more info.

sotte commented 5 years ago

I'm running a version without any "magic imports" and I still get the COMET ERROR: Failed to setup the std logger error. The code looks something like this:

import comet_ml
# ...

def fit_fn(exp, params):
    # ...

@click.argument("param_file", type=str)
@click.argument("n_trials", type=int, default=10)
def main(param_file, n_trials):
    optimizer = comet_ml.Optimizer()
    optimizer.set_params_file(param_file)

    for i in range(n_trials):
        suggestion = optimizer.get_suggestion()
        params = {k: suggestion[k] for k in suggestion}
        exp = comet_ml.Experiment(project_name=PROJECT_NAME, workspace="MYWORKSPACE")
        score_name, score_val = fit_fn(exp, params)
        suggestion.report_score(score_name, score_val)

if __name__ == "__main__":
    main()

This means the problem was not the "magic import" function.

dsblank commented 5 years ago

I was not able to replicate (with our soon-to-be-released version of comet_ml). Here is a complete script:

# file: issue_198-2.py
import comet_ml
import click
import random

def fit_fn(exp, params):
    exp.log_parameters(params)
    return "test", random.random()

PROJECT_NAME="prj-name"

@click.command()
@click.argument("param_file", type=str)
@click.argument("n_trials", type=int, default=10)
def main(param_file, n_trials):
    optimizer = comet_ml.Optimizer()
    optimizer.set_params_file(param_file)

    for i in range(n_trials):
        suggestion = optimizer.get_suggestion()
        params = {k: suggestion[k] for k in suggestion}
        exp = comet_ml.Experiment(project_name=PROJECT_NAME)
        score_name, score_val = fit_fn(exp, params)
        suggestion.report_score(score_name, score_val)

if __name__ == "__main__":
    main()

File: pf1.pcs

x integer [1, 10] [10]
y real [1, 10] [1.0]

Command line:

python3 issue_198-2.py pf1.pcs 10

No errors were reported. So, we either have it fixed, or I still couldn't replicate. I'll report back when the new version of comet_ml is available.

sotte commented 5 years ago

Thanks @dsblank! I ran your script and it worked. No errors.

In my script, all cometml parts are exactly like in yours, but it still throws the "std logger error". So I did a binary search and removed everything from that script until I located the error.

And now it get weird! It seems that the logger does not work anymore (COMET ERROR: Failed to setup the std logger) once I move my pytorch model to the GPU. How can that be? Respectively, when I don't move the model to the GPU I don't get the logger error. Without cometml the console output works properly.

Here is the minimal example:

# file: minimal.py
import comet_ml  # noqa

import os
import random
import time
import click
import torchvision.models

PROJECT_NAME = "mnist_demo"
WORKSPACE = "my_workspace_name"

def fit_fn(exp, params):
    print("PARAMS")
    exp.log_parameters(params)

    print("MODEL")
    module = torchvision.models.resnet18(pretrained=True)
    module.cuda()  # <-- this line causes the std logging error

    val_acc = random.random()
    exp.log_metric("val_acc", val_acc)
    time.sleep(4)
    return "val_acc", val_acc

@click.command()
@click.argument("param_file", type=str)
@click.argument("n_trials", type=int)
def main(param_file, n_trials):
    optimizer = comet_ml.Optimizer()
    optimizer.set_params_file(param_file)

    for i in range(n_trials):
        suggestion = optimizer.get_suggestion()
        params = {k: suggestion[k] for k in suggestion}
        exp = comet_ml.Experiment(project_name=PROJECT_NAME, workspace=WORKSPACE)
        score_name, score_val = fit_fn(exp, params)
        suggestion.report_score(score_name, score_val)

if __name__ == "__main__":
    main()

The PCS file is like yours. I start the script like this.

python minimal.py pf1.pcs 4

For completeness sake:

$ python
Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 18:33:04)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.0.0'

dsblank commented 5 years ago

I ran your example with Python 3.6, torch 1.0.0 and did not get the std logger error. Is English the default language of the machine? I have seen some reference to errors with unicode and logging in the slack channel.

Here is the output from my run (the "Failed to parse" will be removed in the next version of the SDK, FYI):

$ python3 minimal.py pf1.pcs 4
COMET WARNING: Failed to parse argv values. Falling back to naive parsing.
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/2cc78c9e5cab4ca18d3d16ffad58fc88

PARAMS
MODEL
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/2cc78c9e5cab4ca18d3d16ffad58fc88

COMET WARNING: Failed to parse argv values. Falling back to naive parsing.
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/32937167b0d148c48c2158b9282f206d

PARAMS
MODEL
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/32937167b0d148c48c2158b9282f206d

COMET WARNING: Failed to parse argv values. Falling back to naive parsing.
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/027d34c12cb748c391449835229d4a6b

PARAMS
MODEL
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/027d34c12cb748c391449835229d4a6b

COMET WARNING: Failed to parse argv values. Falling back to naive parsing.
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/0187147c9cc645939cf2b2b0a47ba0db

PARAMS
MODEL
COMET INFO: Uploading stats to Comet before program termination (may take several seconds)
COMET INFO: Still uploading
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/0187147c9cc645939cf2b2b0a47ba0db

COMET INFO: Uploading stats to Comet before program termination (may take several seconds)
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/027d34c12cb748c391449835229d4a6b

COMET INFO: Uploading stats to Comet before program termination (may take several seconds)
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/32937167b0d148c48c2158b9282f206d

COMET INFO: Uploading stats to Comet before program termination (may take several seconds)
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/dsblank/general/2cc78c9e5cab4ca18d3d16ffad58fc88

Now trying to upgrade to Python 3.7 to see if that makes a difference...

sotte commented 5 years ago

@dsblank thanks for all your work! I appreciate it!

It's a fresh GCE instance with the "Google Deep Learning VM" image version m16 for pytorch. LANG is "en_US.UTF-8". I don't print any weird unicode characters.

The instance was created with a script that looks something like this:

export IMAGE_FAMILY="pytorch-latest-gpu"
export ZONE="..."
export INSTANCE_NAME="..."
export PROJECT="..."

gcloud compute instances create $INSTANCE_NAME \
  --project=$PROJECT \
  --zone=$ZONE \
  --image-family=$IMAGE_FAMILY \
  --image-project=deeplearning-platform-release \
  --maintenance-policy=TERMINATE \
  --accelerator="type=nvidia-tesla-p100,count=1" \
  --metadata="install-nvidia-driver=True" \
  --preemptible

dsblank commented 5 years ago

@sotte Thanks for the continued exploring... there is one other thing you can do. Can you set these environment vars, and send me (doug@comet.ml) the resulting output file comet.log?

export COMET_LOGGING_FILE="comet.log" 
export COMET_LOGGING_FILE_LEVEL="DEBUG"

sotte commented 5 years ago

I assume that script worked for you with python 3.7?

I enabled debug logging yesterday (?) but the logs did not show anything suspicious. I'll double check later again.

dsblank commented 5 years ago

Hit a snag trying to get a machine with Python 3.7 and a GPU. But not giving up :)

dsblank commented 5 years ago

Update: couldn't find an easy way to get Python 3.7 setup with GPUs. I will see if I can try via Google Cloud. In the meantime, you can also try changing the type of logger. Rather than using the default, you can try a simpler version like this:

exp = comet_ml.Experiment(project_name=PROJECT_NAME, 
    workspace=WORKSPACE,
    auto_output_logging="simple")

sotte commented 5 years ago

I've been too busy to try python 3.6 and to recheck the logs, sorry :(

I hope I can try the simple logger.

I'll get back to you when I have any news.

dsblank commented 5 years ago

No problem! We'll keep working on this end too.

dsblank commented 5 years ago

I believe that all issue related to this problem have been addressed. Please reopen and describe if you have any questions or issues. Thanks for reporting!

comet-ml / issue-tracking

Hyper parameter search fails to setup std logger and stops after first experiment #198