facebookresearch / maskrcnn-benchmark

Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch.
MIT License
9.3k stars 2.49k forks source link

How to record mAP for every epoch on the validation dataset #348

Open nprasad2021 opened 5 years ago

nprasad2021 commented 5 years ago

❓ Questions and Help

Hi,

I have two questions relating to storing training/val/test accuracies/losses -

  1. In the log file, where I can find mAP values for training, validation, and testing datasets?
  2. I have implemented the workflow in issue #171 , where I call the inference function in engine.trainer.do_train. Where should I modify the code (in the "inference" function or in the "coco_evaluation" file to save mAP values and losses at every epoch?
fmassa commented 5 years ago

Hi,

1 - We don't compute the mAP during training. You could compute it for the training set, but it would take quite some time. 2 - You can modify this function, to store the mAP / losses in the format that you want https://github.com/facebookresearch/maskrcnn-benchmark/blob/d28845e112de36781b2b5f7217a34b2b62de8d2f/maskrcnn_benchmark/data/datasets/evaluation/coco/coco_eval.py#L63

Let me know if you have further questions.

nprasad2021 commented 5 years ago

Thanks! I've added the following function "val" to maskrcnn_benchmark.engine.trainer - >

def val(cfg, model, distributed=False):
    if distributed:
        model = model.module
    torch.cuda.empty_cache()  # TODO check if it helps
    iou_types = ("bbox",)
    if cfg.MODEL.MASK_ON:
        iou_types = iou_types + ("segm",)
    output_folders = [None] * (len(cfg.DATASETS.TEST) + len(cfg.DATASETS.TRAIN))
    dataset_names = cfg.DATASETS.TEST + cfg.DATASETS.TRAIN
    if cfg.OUTPUT_DIR:
        for idx, dataset_name in enumerate(dataset_names):
            print(dataset_name)
            output_folder = os.path.join(cfg.OUTPUT_DIR, "inference", dataset_name)
            mkdir(output_folder)
            output_folders[idx] = output_folder
    data_loaders_val = make_data_loader(cfg, is_train=False, is_distributed=distributed)
    output_tuple = {}
    for output_folder, dataset_name, data_loader_val in zip(output_folders, dataset_names, data_loaders_val):
        (dataset_name)
        result = inference(
            model,
            data_loader_val,
            dataset_name=dataset_name,
            iou_types=iou_types,
            box_only=cfg.MODEL.RPN_ONLY,
            device=cfg.MODEL.DEVICE,
            expected_results=cfg.TEST.EXPECTED_RESULTS,
            expected_results_sigma_tol=cfg.TEST.EXPECTED_RESULTS_SIGMA_TOL,
            output_folder=output_folder,
        )[0].results['bbox']
        output_tuple[dataset_name] = {}
        output_tuple[dataset_name]['AP'] = result['AP'].item()
        output_tuple[dataset_name]['AP50'] = result['AP50'].item()

    return output_tuple

I also added the following to the function do_train():

if iteration % checkpoint_period == 0:
            checkpointer.save("model_{:07d}".format(iteration), **arguments)
            print("ENTER VALIDATION CALCULATIONS")
            output[iteration] = val(cfg, model, distributed)

I am getting the following error:

Traceback (most recent call last):
  File "tools/train_net.py", line 184, in <module>
    main()
  File "tools/train_net.py", line 177, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 77, in train
    distributed
  File "/data/home/nprasad/Documents/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 113, in do_train
    losses = sum(loss for loss in loss_dict.values())
AttributeError: 'list' object has no attribute 'values'

Do you have any idea what might have caused this error? Alternatively, I could create a separate script to evaluate all models saved by the checkpointer, similar to tools/test_net.py. How do I select a specific model path for this script to use in evaluation?

I know the line 61 in test_net.py builds the model model = build_detection_model(cfg) How do I specify a model saved at an earlier iteration?

fmassa commented 5 years ago

How do I specify a model saved at an earlier iteration?

Just use tools/test_net.py and pass MODEL.WEIGHT to be the checkpoint you want, that will be much easier I believe, and you'll not need to handle multi-processing due to distributed training.

nprasad2021 commented 5 years ago

Thanks - the following script, adapted from test_net.py performs the appropriate calculations

from maskrcnn_benchmark.utils.env import setup_environment  # noqa F401 isort:skip

import argparse
import os, pickle, sys
print(sys.path)

from os import listdir
from os.path import isfile, join

import torch
from maskrcnn_benchmark.config import cfg
from maskrcnn_benchmark.data import make_data_loader
from maskrcnn_benchmark.engine.inference import inference
from maskrcnn_benchmark.modeling.detector import build_detection_model
from maskrcnn_benchmark.utils.checkpoint import DetectronCheckpointer
from maskrcnn_benchmark.utils.collect_env import collect_env_info
from maskrcnn_benchmark.utils.comm import synchronize, get_rank
from maskrcnn_benchmark.utils.logger import setup_logger
from maskrcnn_benchmark.utils.miscellaneous import mkdir
from maskrcnn_benchmark.engine.plotMaps import plot

def inf(args, cfg):

    num_gpus = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1
    distributed = num_gpus > 1

    if distributed:
        torch.cuda.set_device(args.local_rank)
        torch.distributed.init_process_group(
            backend="nccl", init_method="env://"
        )

    save_dir = os.path.join(cfg.OUTPUT_DIR, "testInf")
    mkdir(save_dir)
    logger = setup_logger("maskrcnn_benchmark", save_dir, get_rank())
    logger.info("Using {} GPUs".format(num_gpus))
    logger.info(cfg)

    logger.info("Collecting env info (might take some time)")
    logger.info("\n" + collect_env_info())
    print(cfg.MODEL.WEIGHT)
    model = build_detection_model(cfg)
    model.to(cfg.MODEL.DEVICE)

    output_dir = cfg.OUTPUT_DIR
    checkpointer = DetectronCheckpointer(cfg, model, save_dir=output_dir)
    _ = checkpointer.load(cfg.MODEL.WEIGHT)

    iou_types = ("bbox",)
    if cfg.MODEL.MASK_ON:
        iou_types = iou_types + ("segm",)
    output_folders = [None] * len(cfg.DATASETS.TEST) 
    dataset_names = cfg.DATASETS.TEST
    print("Dataset Names", dataset_names)
    if cfg.OUTPUT_DIR:
        for idx, dataset_name in enumerate(dataset_names):
            output_folder = os.path.join(cfg.OUTPUT_DIR, "inference", dataset_name)
            mkdir(output_folder)
            output_folders[idx] = output_folder
    data_loaders_val = make_data_loader(cfg, is_train=False, is_distributed=distributed)
    output_tuple = {}
    for output_folder, dataset_name, data_loader_val in zip(output_folders, dataset_names, data_loaders_val):
        r = inference(
            model,
            data_loader_val,
            dataset_name=dataset_name,
            iou_types=iou_types,
            box_only=cfg.MODEL.RPN_ONLY,
            device=cfg.MODEL.DEVICE,
            expected_results=cfg.TEST.EXPECTED_RESULTS,
            expected_results_sigma_tol=cfg.TEST.EXPECTED_RESULTS_SIGMA_TOL,
            output_folder=output_folder,
        )[0].results['bbox']

        output_tuple[dataset_name] = {}
        output_tuple[dataset_name]['AP'] = r['AP'].item()
        output_tuple[dataset_name]['AP50'] = r['AP50'].item()

        synchronize()
    return output_tuple

def recordResults(args, cfg):
    homeDir = "/home/nprasad/Documents/github/maskrcnn-benchmark"
    model_paths = [cfg.MODEL.WEIGHT] + get_model_paths(join(homeDir, cfg.OUTPUT_DIR))
    output = {}
    for path in model_paths:
        cfg.MODEL.WEIGHT = path
        if "final" in path:
            ite = cfg.SOLVER.MAX_ITER
        elif "no" in path:
            ite = 0
        else:
            ite = int(path.split("_")[1].split(".")[0])
        output[ite] = inf(args, cfg)
    plot(output, cfg)

def get_model_paths(directory):
    onlyfiles = [f for f in listdir(directory) if isfile(join(directory, f))]
    return [join(directory, file) for file in onlyfiles if ".pth" in file]

def main():
    parser = argparse.ArgumentParser(description="PyTorch Object Detection Inference")
    parser.add_argument(
        "--config-file",
        default="/home/nprasad/Documents/github/maskrcnn-benchmark/configs/heads.yaml",
        metavar="FILE",
        help="path to config file",
    )
    parser.add_argument("--local_rank", type=int, default=0)
    parser.add_argument(
        "opts",
        help="Modify config options using the command-line",
        default=None,
        nargs=argparse.REMAINDER,
    )

    args = parser.parse_args()

    cfg.merge_from_file(args.config_file)
    cfg.merge_from_list(args.opts)
    recordResults(args, cfg)

if __name__ == "__main__":
    main()

in the function recordResults() the MODEL.WEIGHT is modified. subsequently, accuracies are plotted. However, the plotting function shows no change in accuracy over time. However, training AP50 is 100% from the start, and validation accuracy and testing accuracy maintain a constant number throughout. During training, losses do converge towards 0 for the training set. What do you think could be wrong.

nprasad2021 commented 5 years ago

After running the model for a variable number of iterations, and then running the script below, it seems that the accuracies are different.

Train Model for 100 iterations Test - Acc. on Training is .95

Train Model for 5000 iterations Test - Acc on Training is 1.00

However running the script, and evaluating accuracy at each checkpoint, yields the same exact accuracy for iteration 50, 100, 150, etc.

Therefore, I think there is a problem with the way the config is initialized. - does the model weight change actually change after initialization of the model?

nprasad2021 commented 5 years ago

It seems that despite changing cfg.MODEL.WEIGHT systematically in the script in the comment above, does not yield any change in the actual model retrieval in the model build script. Is this correct?

fmassa commented 5 years ago

@nprasad2021 you are probably setting the OUTPUT_DIR to be the path where you trained your model. In this case, you'll be always picking the last trained checkpoint.

This happens this way in order to easily support restarting jobs.

So I'd recommend changing the OUTPUT_DIR to a different folder, and passing the MODEL.WEIGHT to be the path to your checkpoints.

nprasad2021 commented 5 years ago

Thanks, this works!

muzammil360 commented 5 years ago

Thanks! I've added the following function "val" to maskrcnn_benchmark.engine.trainer - >

def val(cfg, model, distributed=False):
    if distributed:
        model = model.module
    torch.cuda.empty_cache()  # TODO check if it helps
    iou_types = ("bbox",)
    if cfg.MODEL.MASK_ON:
        iou_types = iou_types + ("segm",)
    output_folders = [None] * (len(cfg.DATASETS.TEST) + len(cfg.DATASETS.TRAIN))
    dataset_names = cfg.DATASETS.TEST + cfg.DATASETS.TRAIN
    if cfg.OUTPUT_DIR:
        for idx, dataset_name in enumerate(dataset_names):
            print(dataset_name)
            output_folder = os.path.join(cfg.OUTPUT_DIR, "inference", dataset_name)
            mkdir(output_folder)
            output_folders[idx] = output_folder
    data_loaders_val = make_data_loader(cfg, is_train=False, is_distributed=distributed)
    output_tuple = {}
    for output_folder, dataset_name, data_loader_val in zip(output_folders, dataset_names, data_loaders_val):
        (dataset_name)
        result = inference(
            model,
            data_loader_val,
            dataset_name=dataset_name,
            iou_types=iou_types,
            box_only=cfg.MODEL.RPN_ONLY,
            device=cfg.MODEL.DEVICE,
            expected_results=cfg.TEST.EXPECTED_RESULTS,
            expected_results_sigma_tol=cfg.TEST.EXPECTED_RESULTS_SIGMA_TOL,
            output_folder=output_folder,
        )[0].results['bbox']
        output_tuple[dataset_name] = {}
        output_tuple[dataset_name]['AP'] = result['AP'].item()
        output_tuple[dataset_name]['AP50'] = result['AP50'].item()

    return output_tuple

I also added the following to the function do_train():

if iteration % checkpoint_period == 0:
            checkpointer.save("model_{:07d}".format(iteration), **arguments)
            print("ENTER VALIDATION CALCULATIONS")
            output[iteration] = val(cfg, model, distributed)

I am getting the following error:

Traceback (most recent call last):
  File "tools/train_net.py", line 184, in <module>
    main()
  File "tools/train_net.py", line 177, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 77, in train
    distributed
  File "/data/home/nprasad/Documents/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 113, in do_train
    losses = sum(loss for loss in loss_dict.values())
AttributeError: 'list' object has no attribute 'values'

Do you have any idea what might have caused this error? Alternatively, I could create a separate script to evaluate all models saved by the checkpointer, similar to tools/test_net.py. How do I select a specific model path for this script to use in evaluation?

I know the line 61 in test_net.py builds the model model = build_detection_model(cfg) How do I specify a model saved at an earlier iteration?

@nprasad2021, were you able to resolve AttributeError: 'list' object has no attribute 'values' error? I am also trying to do same thing as you but I get the exact same error.

I would prefer not to evaluate all the model files later.

muzammil360 commented 5 years ago

I found the solution to problem. It seems that when inference() puts the model to evaluation mode 'model.eval()'. So once we resume training, instead of returning losses, it gives prediction. All we have to do is put the model back to 'model.train()' before resuming the training process.