Open nprasad2021 opened 5 years ago
Hi,
1 - We don't compute the mAP during training. You could compute it for the training set, but it would take quite some time. 2 - You can modify this function, to store the mAP / losses in the format that you want https://github.com/facebookresearch/maskrcnn-benchmark/blob/d28845e112de36781b2b5f7217a34b2b62de8d2f/maskrcnn_benchmark/data/datasets/evaluation/coco/coco_eval.py#L63
Let me know if you have further questions.
Thanks! I've added the following function "val" to maskrcnn_benchmark.engine.trainer - >
def val(cfg, model, distributed=False):
if distributed:
model = model.module
torch.cuda.empty_cache() # TODO check if it helps
iou_types = ("bbox",)
if cfg.MODEL.MASK_ON:
iou_types = iou_types + ("segm",)
output_folders = [None] * (len(cfg.DATASETS.TEST) + len(cfg.DATASETS.TRAIN))
dataset_names = cfg.DATASETS.TEST + cfg.DATASETS.TRAIN
if cfg.OUTPUT_DIR:
for idx, dataset_name in enumerate(dataset_names):
print(dataset_name)
output_folder = os.path.join(cfg.OUTPUT_DIR, "inference", dataset_name)
mkdir(output_folder)
output_folders[idx] = output_folder
data_loaders_val = make_data_loader(cfg, is_train=False, is_distributed=distributed)
output_tuple = {}
for output_folder, dataset_name, data_loader_val in zip(output_folders, dataset_names, data_loaders_val):
(dataset_name)
result = inference(
model,
data_loader_val,
dataset_name=dataset_name,
iou_types=iou_types,
box_only=cfg.MODEL.RPN_ONLY,
device=cfg.MODEL.DEVICE,
expected_results=cfg.TEST.EXPECTED_RESULTS,
expected_results_sigma_tol=cfg.TEST.EXPECTED_RESULTS_SIGMA_TOL,
output_folder=output_folder,
)[0].results['bbox']
output_tuple[dataset_name] = {}
output_tuple[dataset_name]['AP'] = result['AP'].item()
output_tuple[dataset_name]['AP50'] = result['AP50'].item()
return output_tuple
I also added the following to the function do_train():
if iteration % checkpoint_period == 0:
checkpointer.save("model_{:07d}".format(iteration), **arguments)
print("ENTER VALIDATION CALCULATIONS")
output[iteration] = val(cfg, model, distributed)
I am getting the following error:
Traceback (most recent call last):
File "tools/train_net.py", line 184, in <module>
main()
File "tools/train_net.py", line 177, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 77, in train
distributed
File "/data/home/nprasad/Documents/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 113, in do_train
losses = sum(loss for loss in loss_dict.values())
AttributeError: 'list' object has no attribute 'values'
Do you have any idea what might have caused this error? Alternatively, I could create a separate script to evaluate all models saved by the checkpointer, similar to tools/test_net.py. How do I select a specific model path for this script to use in evaluation?
I know the line 61 in test_net.py builds the model
model = build_detection_model(cfg)
How do I specify a model saved at an earlier iteration?
How do I specify a model saved at an earlier iteration?
Just use tools/test_net.py
and pass MODEL.WEIGHT
to be the checkpoint you want, that will be much easier I believe, and you'll not need to handle multi-processing due to distributed training.
Thanks - the following script, adapted from test_net.py performs the appropriate calculations
from maskrcnn_benchmark.utils.env import setup_environment # noqa F401 isort:skip
import argparse
import os, pickle, sys
print(sys.path)
from os import listdir
from os.path import isfile, join
import torch
from maskrcnn_benchmark.config import cfg
from maskrcnn_benchmark.data import make_data_loader
from maskrcnn_benchmark.engine.inference import inference
from maskrcnn_benchmark.modeling.detector import build_detection_model
from maskrcnn_benchmark.utils.checkpoint import DetectronCheckpointer
from maskrcnn_benchmark.utils.collect_env import collect_env_info
from maskrcnn_benchmark.utils.comm import synchronize, get_rank
from maskrcnn_benchmark.utils.logger import setup_logger
from maskrcnn_benchmark.utils.miscellaneous import mkdir
from maskrcnn_benchmark.engine.plotMaps import plot
def inf(args, cfg):
num_gpus = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1
distributed = num_gpus > 1
if distributed:
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(
backend="nccl", init_method="env://"
)
save_dir = os.path.join(cfg.OUTPUT_DIR, "testInf")
mkdir(save_dir)
logger = setup_logger("maskrcnn_benchmark", save_dir, get_rank())
logger.info("Using {} GPUs".format(num_gpus))
logger.info(cfg)
logger.info("Collecting env info (might take some time)")
logger.info("\n" + collect_env_info())
print(cfg.MODEL.WEIGHT)
model = build_detection_model(cfg)
model.to(cfg.MODEL.DEVICE)
output_dir = cfg.OUTPUT_DIR
checkpointer = DetectronCheckpointer(cfg, model, save_dir=output_dir)
_ = checkpointer.load(cfg.MODEL.WEIGHT)
iou_types = ("bbox",)
if cfg.MODEL.MASK_ON:
iou_types = iou_types + ("segm",)
output_folders = [None] * len(cfg.DATASETS.TEST)
dataset_names = cfg.DATASETS.TEST
print("Dataset Names", dataset_names)
if cfg.OUTPUT_DIR:
for idx, dataset_name in enumerate(dataset_names):
output_folder = os.path.join(cfg.OUTPUT_DIR, "inference", dataset_name)
mkdir(output_folder)
output_folders[idx] = output_folder
data_loaders_val = make_data_loader(cfg, is_train=False, is_distributed=distributed)
output_tuple = {}
for output_folder, dataset_name, data_loader_val in zip(output_folders, dataset_names, data_loaders_val):
r = inference(
model,
data_loader_val,
dataset_name=dataset_name,
iou_types=iou_types,
box_only=cfg.MODEL.RPN_ONLY,
device=cfg.MODEL.DEVICE,
expected_results=cfg.TEST.EXPECTED_RESULTS,
expected_results_sigma_tol=cfg.TEST.EXPECTED_RESULTS_SIGMA_TOL,
output_folder=output_folder,
)[0].results['bbox']
output_tuple[dataset_name] = {}
output_tuple[dataset_name]['AP'] = r['AP'].item()
output_tuple[dataset_name]['AP50'] = r['AP50'].item()
synchronize()
return output_tuple
def recordResults(args, cfg):
homeDir = "/home/nprasad/Documents/github/maskrcnn-benchmark"
model_paths = [cfg.MODEL.WEIGHT] + get_model_paths(join(homeDir, cfg.OUTPUT_DIR))
output = {}
for path in model_paths:
cfg.MODEL.WEIGHT = path
if "final" in path:
ite = cfg.SOLVER.MAX_ITER
elif "no" in path:
ite = 0
else:
ite = int(path.split("_")[1].split(".")[0])
output[ite] = inf(args, cfg)
plot(output, cfg)
def get_model_paths(directory):
onlyfiles = [f for f in listdir(directory) if isfile(join(directory, f))]
return [join(directory, file) for file in onlyfiles if ".pth" in file]
def main():
parser = argparse.ArgumentParser(description="PyTorch Object Detection Inference")
parser.add_argument(
"--config-file",
default="/home/nprasad/Documents/github/maskrcnn-benchmark/configs/heads.yaml",
metavar="FILE",
help="path to config file",
)
parser.add_argument("--local_rank", type=int, default=0)
parser.add_argument(
"opts",
help="Modify config options using the command-line",
default=None,
nargs=argparse.REMAINDER,
)
args = parser.parse_args()
cfg.merge_from_file(args.config_file)
cfg.merge_from_list(args.opts)
recordResults(args, cfg)
if __name__ == "__main__":
main()
in the function recordResults()
the MODEL.WEIGHT is modified. subsequently, accuracies are plotted. However, the plotting function shows no change in accuracy over time. However, training AP50 is 100% from the start, and validation accuracy and testing accuracy maintain a constant number throughout. During training, losses do converge towards 0 for the training set. What do you think could be wrong.
After running the model for a variable number of iterations, and then running the script below, it seems that the accuracies are different.
Train Model for 100 iterations Test - Acc. on Training is .95
Train Model for 5000 iterations Test - Acc on Training is 1.00
However running the script, and evaluating accuracy at each checkpoint, yields the same exact accuracy for iteration 50, 100, 150, etc.
Therefore, I think there is a problem with the way the config is initialized. - does the model weight change actually change after initialization of the model?
It seems that despite changing cfg.MODEL.WEIGHT systematically in the script in the comment above, does not yield any change in the actual model retrieval in the model build script. Is this correct?
@nprasad2021 you are probably setting the OUTPUT_DIR
to be the path where you trained your model. In this case, you'll be always picking the last trained checkpoint.
This happens this way in order to easily support restarting jobs.
So I'd recommend changing the OUTPUT_DIR
to a different folder, and passing the MODEL.WEIGHT
to be the path to your checkpoints.
Thanks, this works!
Thanks! I've added the following function "val" to maskrcnn_benchmark.engine.trainer - >
def val(cfg, model, distributed=False): if distributed: model = model.module torch.cuda.empty_cache() # TODO check if it helps iou_types = ("bbox",) if cfg.MODEL.MASK_ON: iou_types = iou_types + ("segm",) output_folders = [None] * (len(cfg.DATASETS.TEST) + len(cfg.DATASETS.TRAIN)) dataset_names = cfg.DATASETS.TEST + cfg.DATASETS.TRAIN if cfg.OUTPUT_DIR: for idx, dataset_name in enumerate(dataset_names): print(dataset_name) output_folder = os.path.join(cfg.OUTPUT_DIR, "inference", dataset_name) mkdir(output_folder) output_folders[idx] = output_folder data_loaders_val = make_data_loader(cfg, is_train=False, is_distributed=distributed) output_tuple = {} for output_folder, dataset_name, data_loader_val in zip(output_folders, dataset_names, data_loaders_val): (dataset_name) result = inference( model, data_loader_val, dataset_name=dataset_name, iou_types=iou_types, box_only=cfg.MODEL.RPN_ONLY, device=cfg.MODEL.DEVICE, expected_results=cfg.TEST.EXPECTED_RESULTS, expected_results_sigma_tol=cfg.TEST.EXPECTED_RESULTS_SIGMA_TOL, output_folder=output_folder, )[0].results['bbox'] output_tuple[dataset_name] = {} output_tuple[dataset_name]['AP'] = result['AP'].item() output_tuple[dataset_name]['AP50'] = result['AP50'].item() return output_tuple
I also added the following to the function do_train():
if iteration % checkpoint_period == 0: checkpointer.save("model_{:07d}".format(iteration), **arguments) print("ENTER VALIDATION CALCULATIONS") output[iteration] = val(cfg, model, distributed)
I am getting the following error:
Traceback (most recent call last): File "tools/train_net.py", line 184, in <module> main() File "tools/train_net.py", line 177, in main model = train(cfg, args.local_rank, args.distributed) File "tools/train_net.py", line 77, in train distributed File "/data/home/nprasad/Documents/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 113, in do_train losses = sum(loss for loss in loss_dict.values()) AttributeError: 'list' object has no attribute 'values'
Do you have any idea what might have caused this error? Alternatively, I could create a separate script to evaluate all models saved by the checkpointer, similar to tools/test_net.py. How do I select a specific model path for this script to use in evaluation?
I know the line 61 in test_net.py builds the model
model = build_detection_model(cfg)
How do I specify a model saved at an earlier iteration?
@nprasad2021, were you able to resolve AttributeError: 'list' object has no attribute 'values'
error? I am also trying to do same thing as you but I get the exact same error.
I would prefer not to evaluate all the model files later.
I found the solution to problem. It seems that when inference()
puts the model to evaluation mode 'model.eval()'. So once we resume training, instead of returning losses
, it gives prediction. All we have to do is put the model back to 'model.train()' before resuming the training process.
❓ Questions and Help
Hi,
I have two questions relating to storing training/val/test accuracies/losses -