Megvii-BaseDetection / YOLOX

YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with MegEngine, ONNX, TensorRT, ncnn, and OpenVINO supported. Documentation: https://yolox.readthedocs.io/
Apache License 2.0
9.3k stars 2.19k forks source link

How to add Wandb Sweeps for hyper-parameter optimization? #1285

Open GunjanPatel1108 opened 2 years ago

GunjanPatel1108 commented 2 years ago

Any idea for adding sweeps for hyper-parameters optimization for own custom dataset for parameters like learning rate,etc for minimizing loss.@wandb

morganmcg1 commented 2 years ago

Hey @GunjanPatel1108 , I work at W&B, thanks for your interest! @manangoel99 might be able to help you out there

manangoel99 commented 2 years ago

Hi @GunjanPatel1108 ! Are you using yolox.tools.train or you have your own custom training script?

GunjanPatel1108 commented 2 years ago

Hello @manangoel99,

Hi @GunjanPatel1108 ! Are you using yolox.tools.train or you have your own custom training script?

i am using the same training script.

manangoel99 commented 2 years ago

@GunjanPatel1108 apologies for the delay. The way sweeps provide command line arguments is slightly different to how the YOLOX training script does it. So I came up with a small modification to the yolox/tools/train.py file to inegrate it with sweeps.

import argparse
import random
import warnings
from loguru import logger

import torch
import torch.backends.cudnn as cudnn
import wandb

from yolox.core import launch
from yolox.exp import Exp, get_exp
from yolox.utils import configure_module, configure_nccl, configure_omp, get_num_devices

sweep_config = {
    "method": "bayes",
    "metric": {
        "name": "val/COCOAP50",
        "goal": "maximize"
    },
    "parameters": {
        "weight_decay": {
            "min": 0.001,
            "max": 0.005
        }
    }
}

def make_parser():
    parser = argparse.ArgumentParser("YOLOX train parser")
    parser.add_argument("-expn", "--experiment-name", type=str, default=None)
    parser.add_argument("-n", "--name", type=str, default=None, help="model name")

    # distributed
    parser.add_argument(
        "--dist-backend", default="nccl", type=str, help="distributed backend"
    )
    parser.add_argument(
        "--dist-url",
        default=None,
        type=str,
        help="url used to set up distributed training",
    )
    parser.add_argument("-b", "--batch-size", type=int, default=64, help="batch size")
    parser.add_argument(
        "-d", "--devices", default=None, type=int, help="device for training"
    )
    parser.add_argument(
        "-f",
        "--exp_file",
        default=None,
        type=str,
        help="plz input your experiment description file",
    )
    parser.add_argument(
        "--resume", default=False, action="store_true", help="resume training"
    )
    parser.add_argument("-c", "--ckpt", default=None, type=str, help="checkpoint file")
    parser.add_argument(
        "-e",
        "--start_epoch",
        default=None,
        type=int,
        help="resume training start epoch",
    )
    parser.add_argument(
        "--num_machines", default=1, type=int, help="num of node for training"
    )
    parser.add_argument(
        "--machine_rank", default=0, type=int, help="node rank for multi-node training"
    )
    parser.add_argument(
        "--fp16",
        dest="fp16",
        default=False,
        action="store_true",
        help="Adopting mix precision training.",
    )
    parser.add_argument(
        "--cache",
        dest="cache",
        default=False,
        action="store_true",
        help="Caching imgs to RAM for fast training.",
    )
    parser.add_argument(
        "-o",
        "--occupy",
        dest="occupy",
        default=False,
        action="store_true",
        help="occupy GPU memory first for training.",
    )
    parser.add_argument(
        "-l",
        "--logger",
        type=str,
        help="Logger to be used for metrics",
        default="tensorboard"
    )
    parser.add_argument(
        "opts",
        help="Modify config options using the command-line",
        default=None,
        nargs=argparse.REMAINDER,
    )
    return parser

@logger.catch
def main(exp: Exp, args):
    if exp.seed is not None:
        random.seed(exp.seed)
        torch.manual_seed(exp.seed)
        cudnn.deterministic = True
        warnings.warn(
            "You have chosen to seed training. This will turn on the CUDNN deterministic setting, "
            "which can slow down your training considerably! You may see unexpected behavior "
            "when restarting from checkpoints."
        )

    # set environment variables for distributed training
    configure_nccl()
    configure_omp()
    cudnn.benchmark = True

    trainer = exp.get_trainer(args)
    trainer.train()

sweep_id = wandb.sweep(sweep_config, project="random")

def train():
    with wandb.init():
        config = wandb.config
        configure_module()
        args = make_parser().parse_args()
        exp = get_exp(args.exp_file, args.name)
        exp.merge(args.opts)

        for k, v in config.items():
            setattr(exp, k, v)

        if not args.experiment_name:
            args.experiment_name = exp.exp_name

        num_gpu = get_num_devices() if args.devices is None else args.devices
        assert num_gpu <= get_num_devices()

        dist_url = "auto" if args.dist_url is None else args.dist_url
        launch(
            main,
            num_gpu,
            args.num_machines,
            args.machine_rank,
            backend=args.dist_backend,
            dist_url=dist_url,
            args=(exp, args),
        )

wandb.agent(sweep_id, train)

All your sweep parameters can be added to the sweep_config dict in the beginning just the way you would in yaml file. You can run the above script exactly like you would run the YOLOX training script.

python tools/train.py -f exps/example/custom/nano.py -d 2 -b 64 --fp16 -o -c ../yolox_nano.pth --logger wandb \
    print_interval 1 \
    eval_interval 1 \
    max_epoch 5 \
    wandb-project test \
    wandb-log_checkpoints True

This would start a single run.

python sweep.py -f exps/example/custom/nano.py -d 2 -b 64 --fp16 -o -c ../yolox_nano.pth --logger wandb \
    print_interval 1 \
    eval_interval 1 \
    max_epoch 5 \
    wandb-project test \
    wandb-log_checkpoints True

This would start the appropriate sweep.

Please let me know if this works for you

GunjanPatel10 commented 2 years ago

@manangoel99 Thank you, its working but i dont want to save all artificats(i.e checkpoints in form of weights) because then it will store too much of data, rather i just want to save weights with best allias and also want to clear cache. What modifications should i do? Tried to add this in logger.py import wandb api = wandb.Api() artifact_type, artifact_name ="model", f"model-{self.run.id}" for version in api.artifact_versions(artifact_type, artifact_name): if len(version.aliases) == 0: version.delete() Could you please have a look and help me.

manangoel99 commented 2 years ago

I'm assuming you're talking about this file. A quick and easy way could be to modify this snippet in the file

https://github.com/Megvii-BaseDetection/YOLOX/blob/main/yolox/utils/logger.py#L211-L214

and change it to to this

if is_best:
    aliases.append("best")
    self.run.log_artifact(artifact, aliases=aliases)

This would only log the best model at each step

GunjanPatel1108 commented 2 years ago

I'm assuming you're talking about this file. A quick and easy way could be to modify this snippet in the file

https://github.com/Megvii-BaseDetection/YOLOX/blob/main/yolox/utils/logger.py#L211-L214

and change it to to this

if is_best:
    aliases.append("best")
    self.run.log_artifact(artifact, aliases=aliases)

This would only log the best model at each step

yes, but it would also log every weights at every epochs along with best and latest alias. Now, i want to save only checkpoints with best or with allias. So, what should i do?

manangoel99 commented 2 years ago

Ah okay so you want to save the best model at the end and not log any others. In that case remove this flag wandb-log_checkpoints True. The YOLOX trainer saves the best file with the name best_ckpt.pth which I think you can then log at the end of training.

def train():
    with wandb.init():
        config = wandb.config
        configure_module()
        args = make_parser().parse_args()
        exp = get_exp(args.exp_file, args.name)
        exp.merge(args.opts)

        for k, v in config.items():
            setattr(exp, k, v)

        if not args.experiment_name:
            args.experiment_name = exp.exp_name

        num_gpu = get_num_devices() if args.devices is None else args.devices
        assert num_gpu <= get_num_devices()

        dist_url = "auto" if args.dist_url is None else args.dist_url
        launch(
            main,
            num_gpu,
            args.num_machines,
            args.machine_rank,
            backend=args.dist_backend,
            dist_url=dist_url,
            args=(exp, args),
        )
        artifact = wandb.Artifact("model-name", type="model")
        artifact.add_file("<path_to_dir>/best_ckpt.pth")
        wandb.log_artifact(artifact, aliases=["best"])

For example for me path_to_dir was YOLOX_outputs/yolox_nano

GunjanPatel1108 commented 2 years ago

Ah okay so you want to save the best model at the end and not log any others. In that case remove this flag wandb-log_checkpoints True. The YOLOX trainer saves the best file with the name best_ckpt.pth which I think you can then log at the end of training.

def train():
    with wandb.init():
        config = wandb.config
        configure_module()
        args = make_parser().parse_args()
        exp = get_exp(args.exp_file, args.name)
        exp.merge(args.opts)

        for k, v in config.items():
            setattr(exp, k, v)

        if not args.experiment_name:
            args.experiment_name = exp.exp_name

        num_gpu = get_num_devices() if args.devices is None else args.devices
        assert num_gpu <= get_num_devices()

        dist_url = "auto" if args.dist_url is None else args.dist_url
        launch(
            main,
            num_gpu,
            args.num_machines,
            args.machine_rank,
            backend=args.dist_backend,
            dist_url=dist_url,
            args=(exp, args),
        )
        artifact = wandb.Artifact("model-name", type="model")
        artifact.add_file("<path_to_dir>/best_ckpt.pth")
        wandb.log_artifact(artifact, aliases=["best"])

For example for me path_to_dir was YOLOX_outputs/yolox_nano

Not at the end, during training itself. i just want to log files with aliases like best during training. As i am doing sweeps and my model is very big and also its .pth size is also huge. So logging every checkpoint during training is causing problem due to space issue in Wandb api. So i added above line to remove all .pth files without having alias during training, but its not working. So could you please guide. For example: i am running for 20 epochs, so every epoch will be saved as artifact in wandb, now suppose 1st epoch is latest as well as best, then do nothing now suppose at 5th epoch which is V4 i get best AP so delete all previous chekpoint saved as artifact.

like this.Is it possible?

manangoel99 commented 2 years ago

Ah okay! I get what you mean. I'm not entirely sure if that's possible. Let me check and get back to you

manangoel99 commented 2 years ago

@GunjanPatel1108 I checked on our end. There are two ways to do this. One is programmatically like what you mentioned earlier.Documentation is available here

And the other is to do it manually. Go to the run dashboard and delete the artifact through the UI

GunjanPatel1108 commented 2 years ago

@manangoel99 where can i add this lines and what would be my path?

manangoel99 commented 2 years ago

I think you can add it here https://github.com/Megvii-BaseDetection/YOLOX/blob/main/yolox/utils/logger.py#L211-L214.

if is_best:
    for logged_artifact in self.run.logged_artifacts():
        if logged_artifact.type == "model":
            logged_artifact.delete(delete_aliases=True)
    aliases.append("best")
    self.run.log_artifact(artifact, aliases=aliases)
GunjanPatel1108 commented 2 years ago

I think you can add it here https://github.com/Megvii-BaseDetection/YOLOX/blob/main/yolox/utils/logger.py#L211-L214.

if is_best:
    for logged_artifact in self.run.logged_artifacts():
        if logged_artifact.type == "model":
            logged_artifact.delete(delete_aliases=True)
    aliases.append("best")
    self.run.log_artifact(artifact, aliases=aliases)

It throws an error stated like: there is no logged_artifacts

tfriedel commented 1 year ago

I'm trying out the proposed solution but run into this error:

Exception in thread SockSrvRdThr:
Traceback (most recent call last):
  File "/home/thomas/conda/envs/icevision/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/thomas/conda/envs/icevision/lib/python3.10/site-packages/wandb/sdk/service/server_sock.py", line 112, in run
    shandler(sreq)
  File "/home/thomas/conda/envs/icevision/lib/python3.10/site-packages/wandb/sdk/service/server_sock.py", line 173, in server_record_publish
    iface = self._mux.get_stream(stream_id).interface
  File "/home/thomas/conda/envs/icevision/lib/python3.10/site-packages/wandb/sdk/service/streams.py", line 199, in get_stream
    stream = self._streams[stream_id]
KeyError: '44q5wvqi'

which looks like it may be the same as this one: https://github.com/wandb/wandb/issues/4384

I'm using multiple gpus. I tried disabling the wandb service with WANDB_DISABLE_SERVICE=True, but then I got another error:

2022-12-18 15:29:14 | INFO     | yolox.utils.logger:221 - There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()`before instantiating `WandbLogger`.
wandb: WARNING Config item 'weight_decay' was locked by 'sweep' (ignored update).
2022-12-18 15:29:14 | ERROR    | yolox.core.launch:147 - An error has been caught in function '_distributed_worker', process 'ForkProcess-2' (2568223), thread 'Thread-2 (_run_job)' (139907024738048):