Closed tshead2 closed 4 years ago
The traceback seems to imply that ground truth data is missing,
That's correct and it's because the default dataloader for test-set does not include ground truth: https://github.com/facebookresearch/detectron2/blob/ee0cbd8c67622ff2753493ce0dcfeb3e3dd9945e/detectron2/data/dataset_mapper.py#L109-L112
You can provide mapper=
to create a dataloader that loads test data with ground truth.
However, switching to a training loader produces a different error:
That's because you're not calling data.build_detection_train_loader
following its API: https://detectron2.readthedocs.io/modules/data.html#detectron2.data.build_detection_train_loader
Ah, copy-and-paste error. It's working now, thanks for the assist.
Cheers, Tim
Hi @tshead2,
after creating hooker class, I performed the following:
valLoss = ValidationLoss(cfg, 'my_validation_set')
hooks = [valLoss]
trainer.register_hooks(hooks)
DefaultTrainer.build_test_loader(cfg, "my_validation_set")
Still get the same error, do I have to create my own mapper function? Can you provide me a template?
Thanks.
Hi,
I have an hacky solution for this, I'll leave it here in case anyone needs it or someone has suggestions on how to improve it.
from detectron2.engine import HookBase
from detectron2.data import build_detection_train_loader
import detectron2.utils.comm as comm
cfg.DATASETS.VAL = ("voc_2007_val",)
class ValidationLoss(HookBase):
def __init__(self, cfg):
super().__init__()
self.cfg = cfg.clone()
self.cfg.DATASETS.TRAIN = cfg.DATASETS.VAL
self._loader = iter(build_detection_train_loader(self.cfg))
def after_step(self):
data = next(self._loader)
with torch.no_grad():
loss_dict = self.trainer.model(data)
losses = sum(loss_dict.values())
assert torch.isfinite(losses).all(), loss_dict
loss_dict_reduced = {"val_" + k: v.item() for k, v in
comm.reduce_dict(loss_dict).items()}
losses_reduced = sum(loss for loss in loss_dict_reduced.values())
if comm.is_main_process():
self.trainer.storage.put_scalars(total_val_loss=losses_reduced,
**loss_dict_reduced)
And then
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = Trainer(cfg)
val_loss = ValidationLoss(cfg)
trainer.register_hooks([val_loss])
# swap the order of PeriodicWriter and ValidationLoss
trainer._hooks = trainer._hooks[:-2] + trainer._hooks[-2:][::-1]
trainer.resume_or_load(resume=False)
trainer.train()
Ah, copy-and-paste error. It's working now, thanks for the assist.
Cheers, Tim
Hi @tshead2 , could you please mention the copy paste error ? How did you get it to work using build_detection_train_loader
?
Hi, I have written it and commented the code, you can see it here: https://medium.com/@apofeniaco/training-on-detectron2-with-a-validation-set-and-plot-loss-on-it-to-avoid-overfitting-6449418fbf4e or just the gist here: https://gist.github.com/ortegatron/c0dad15e49c2b74de8bb09a5615d9f6b
@ortegatron Aren't you accumulating gradients in your implementation?
@alono88 can you please suggest me how that could be happening? On a code level, I'm just doing the sum on each iteration. But maybe I'm missing something at a general understanding level of how gradient behave
@ortegatron My mistake. It seems in this discussion that using torch.no_grad()
only affects the memory and no intermediate tensors are stored.
@ortegatron, first thank you for your code, it's very helpful !
I have the same question as @alono88. In your code, shouldn't the model be switched to eval mode (model.eval()
) somewhere so that you don't accumulate the gradients?
[inference_on_dataset
does that by calling inference_context
in evaluator.py
].
Or did I miss something?
once again, thank you!
@wesleylp eval mode does not effect gradient accumulation, it adjusts layers such as dropout. In addition, using eval mode will cause the model to output predictions instead of loss values so you will have nothing to write.
@ortegatron I was trying to run your code using multiple GPUs and it does not work. Have you had experience with such setting or did you run it on a single gpu?
Hi Alono, nice that you answer, I was about to research about wesleylp question.
I have only tried it con single gpu, no idea what changes would multiple cpu imply, sorry
Hi Alono, nice that you answer, I was about to research about wesleylp question.
I have only tried it con single gpu, no idea what changes would multiple cpu imply, sorry
Hi I tried your code but after running validation it just hangs and does not run anything else. Please help me. Thank you very much. After a while, an error popped up: RuntimeError: [/opt/conda/conda-bld/pytorch_1587428207430/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete
@dangmanhtruong1995 hi, have you solved it?
Hi Alono, nice that you answer, I was about to research about wesleylp question. I have only tried it con single gpu, no idea what changes would multiple cpu imply, sorry
Hi I tried your code but after running validation it just hangs and does not run anything else. Please help me. Thank you very much. After a while, an error popped up: RuntimeError: [/opt/conda/conda-bld/pytorch_1587428207430/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete
@dangmanhtruong1995 hi, have you solved it?
Hi Alono, nice that you answer, I was about to research about wesleylp question. I have only tried it con single gpu, no idea what changes would multiple cpu imply, sorry
Hi I tried your code but after running validation it just hangs and does not run anything else. Please help me. Thank you very much. After a while, an error popped up: RuntimeError: [/opt/conda/conda-bld/pytorch_1587428207430/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete
Hi, I have not been able to solve it.
I copied the idea from @mnslarcher and wrote the following two functions for my keypoint detector (resnet50) algorithm.
def build_valid_loader(cfg):
_cfg = cfg.clone()
_cfg.defrost() # make this cfg mutable.
_cfg.DATASETS.TRAIN = cfg.DATASETS.TEST
return build_detection_train_loader(_cfg)
def store_valid_loss(model, data, storage):
training_mode = model.training
with torch.no_grad():
loss_dict = model(data)
losses = sum(loss_dict.values())
assert torch.isfinite(losses).all(), loss_dict
loss_dict_reduced = {k: v.item()
for k, v in comm.reduce_dict(loss_dict).items()}
losses_reduced = sum(loss for loss in loss_dict_reduced.values())
if comm.is_main_process():
storage.put_scalars(val_loss=losses_reduced, **loss_dict_reduced)
model.train(training_mode)
then in plain_train_net.py
I am calling them as bellow..
val_data_loader = build_valid_loader(cfg)
logger.info("Starting training from iteration {}".format(start_iter))
with EventStorage(start_iter) as storage:
for data, val_data, iteration in zip(data_loader, val_data_loader, range(start_iter, max_iter)):
iteration = iteration + 1
..
..
#At the end of the for loop.
# Calculate and log validation loss.
store_valid_loss(model, val_data, storage)
after 1k iteration, loss_keypoint
is increasing, but total_loss
is same compared to without store_valid_loss
call. What am I missing? Can anyone please help to understand?
Hi,
I have an hacky solution for this, I'll leave it here in case anyone needs it or someone has suggestions on how to improve it.
from detectron2.engine import HookBase from detectron2.data import build_detection_train_loader import detectron2.utils.comm as comm cfg.DATASETS.VAL = ("voc_2007_val",) class ValidationLoss(HookBase): def __init__(self, cfg): super().__init__() self.cfg = cfg.clone() self.cfg.DATASETS.TRAIN = cfg.DATASETS.VAL self._loader = iter(build_detection_train_loader(self.cfg)) def after_step(self): data = next(self._loader) with torch.no_grad(): loss_dict = self.trainer.model(data) losses = sum(loss_dict.values()) assert torch.isfinite(losses).all(), loss_dict loss_dict_reduced = {"val_" + k: v.item() for k, v in comm.reduce_dict(loss_dict).items()} losses_reduced = sum(loss for loss in loss_dict_reduced.values()) if comm.is_main_process(): self.trainer.storage.put_scalars(total_val_loss=losses_reduced, **loss_dict_reduced)
And then
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True) trainer = Trainer(cfg) val_loss = ValidationLoss(cfg) trainer.register_hooks([val_loss]) # swap the order of PeriodicWriter and ValidationLoss trainer._hooks = trainer._hooks[:-2] + trainer._hooks[-2:][::-1] trainer.resume_or_load(resume=False) trainer.train()
I just wondered if there was a way to only calculate the validation loss every 500 iterations instead of every 20? I found that your code even works on my multi-GPU setup, but calculating the validation loss every 20 iterations is very costly time-wise.
Hi, I have an hacky solution for this, I'll leave it here in case anyone needs it or someone has suggestions on how to improve it.
from detectron2.engine import HookBase from detectron2.data import build_detection_train_loader import detectron2.utils.comm as comm cfg.DATASETS.VAL = ("voc_2007_val",) class ValidationLoss(HookBase): def __init__(self, cfg): super().__init__() self.cfg = cfg.clone() self.cfg.DATASETS.TRAIN = cfg.DATASETS.VAL self._loader = iter(build_detection_train_loader(self.cfg)) def after_step(self): data = next(self._loader) with torch.no_grad(): loss_dict = self.trainer.model(data) losses = sum(loss_dict.values()) assert torch.isfinite(losses).all(), loss_dict loss_dict_reduced = {"val_" + k: v.item() for k, v in comm.reduce_dict(loss_dict).items()} losses_reduced = sum(loss for loss in loss_dict_reduced.values()) if comm.is_main_process(): self.trainer.storage.put_scalars(total_val_loss=losses_reduced, **loss_dict_reduced)
And then
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True) trainer = Trainer(cfg) val_loss = ValidationLoss(cfg) trainer.register_hooks([val_loss]) # swap the order of PeriodicWriter and ValidationLoss trainer._hooks = trainer._hooks[:-2] + trainer._hooks[-2:][::-1] trainer.resume_or_load(resume=False) trainer.train()
I just wondered if there was a way to only calculate the validation loss every 500 iterations instead of every 20? I found that your code even works on my multi-GPU setup, but calculating the validation loss every 20 iterations is very costly time-wise.
Hi @bconsolvo-zvelo , its a lot that I don't play with this library but something like this PROBABLY (I'm not 100% sure) works:
YOUR_MAGIC_NUMBER = 42
class ValidationLoss(HookBase):
def __init__(self, cfg):
super().__init__()
self.cfg = cfg.clone()
self.cfg.DATASETS.TRAIN = cfg.DATASETS.VAL
self._loader = iter(build_detection_train_loader(self.cfg))
self.num_steps = 0
def after_step(self):
self.num_steps += 1
if self.num_steps % YOUR_MAGIC_NUMBER == 0:
data = next(self._loader)
with torch.no_grad():
loss_dict = self.trainer.model(data)
losses = sum(loss_dict.values())
assert torch.isfinite(losses).all(), loss_dict
loss_dict_reduced = {"val_" + k: v.item() for k, v in
comm.reduce_dict(loss_dict).items()}
losses_reduced = sum(loss for loss in loss_dict_reduced.values())
if comm.is_main_process():
self.trainer.storage.put_scalars(total_val_loss=losses_reduced,
**loss_dict_reduced)
else:
pass
Now I'm sure you can do a lot better than this, for example probably you don't have to re-define a concept like "num_steps" and instead to hardcode a number you can have something like this
cfg.VAL_INTERVAL = 42
...
if self.num_steps % self.cfg.VAL_INTERVAL == 0:
I didn't test this solution so sorry if you will find out that for some reason it doesn't work, in case it does work or in case you will find a better solution, please comment here so also others can benefit from it
Thank you for your comments. For whatever reason, I am finding that calculating the validation at different steps produces different validation loss results (drastic orders of magnitude difference). Not sure if it is my setup/data or something inherent with the code. But trying to resolve it.
I have also heard the suggestion of not using hooks, but rather using "run_step" as seen here: https://tshafer.com/blog/2020/06/detectron2-eval-loss
Still investigating. Thank you for your prompt reply.
For some reason tensorboard will not display the validation-loss at all when using mnslarcher's code (only run if self.num_steps % MAGIC_NUM == 0) - the val_losses are computed and shown in the console, but somehow tensorboard does not like them.. Validation losses show fine if it runs on every call...
@mnslarcher
On a 4xGPU setup, if I tell it to calculate the validation loss on the same iteration as I calculate my coco_eval results, it hangs indefinitely, just before finishing the inference calculation. Every other iteration works except on the exact one where it is calculating the coco_eval inference. Just very strange behaviour. It also seems a bit odd that now I have to calculate inference on all of my validation data twice: once for the coco_eval results, and then on another iteration for calculating the validation loss. Both are doing inference and comparing them to ground truth: coco_eval produces AP results, and the other produces just validation losses. Would be nice to combine somehow, and figure out why it is breaking whenever I put the
cfg.TEST.EVAL_PERIOD
as the same iteration as where I am telling it to calculate the validation loss.
Some other questions:
trainer._hooks = trainer._hooks[:-2] + trainer._hooks[-2:][::-1]
comm.synchronize()
? I thought this was necessary for 4 GPUs.Thanks!
Hi @bconsolvo-zvelo
Said this, I'm not sure, I'm not an expert of this library and I don't use it from a long time so is better if you open a specific issue so someone more expert then me can answer your questions
Hi Alono, nice that you answer, I was about to research about wesleylp question. I have only tried it con single gpu, no idea what changes would multiple cpu imply, sorry
Hi I tried your code but after running validation it just hangs and does not run anything else. Please help me. Thank you very much. After a while, an error popped up: RuntimeError: [/opt/conda/conda-bld/pytorch_1587428207430/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete
Did you find the solution to this issue?
Hi Alono, nice that you answer, I was about to research about wesleylp question. I have only tried it con single gpu, no idea what changes would multiple cpu imply, sorry
Hi I tried your code but after running validation it just hangs and does not run anything else. Please help me. Thank you very much. After a while, an error popped up: RuntimeError: [/opt/conda/conda-bld/pytorch_1587428207430/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete
Did you find the solution to this issue?
Hi, no I have not.
Hi Alono, nice that you answer, I was about to research about wesleylp question.
I have only tried it con single gpu, no idea what changes would multiple cpu imply, sorry
Fix to the issue.
def build_hooks(self):
hooks = super().build_hooks()
hooks.insert(-1,LossEvalHook(
cfg.TEST.EVAL_PERIOD,
self.model,
build_detection_test_loader(
self.cfg,
self.cfg.DATASETS.TEST[0],
DatasetMapper(self.cfg,True)
)
))
# swap the order of PeriodicWriter and ValidationLoss
# code hangs with no GPUs > 1 if this line is removed
hooks = hooks[:-2] + hooks[-2:][::-1]
return hooks
Hi Alono, nice that you answer, I was about to research about wesleylp question. I have only tried it con single gpu, no idea what changes would multiple cpu imply, sorry
Hi I tried your code but after running validation it just hangs and does not run anything else. Please help me. Thank you very much. After a while, an error popped up: RuntimeError: [/opt/conda/conda-bld/pytorch_1587428207430/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete
Did you find the solution to this issue?
Fix to the issue.
def build_hooks(self):
hooks = super().build_hooks()
hooks.insert(-1,LossEvalHook(
cfg.TEST.EVAL_PERIOD,
self.model,
build_detection_test_loader(
self.cfg,
self.cfg.DATASETS.TEST[0],
DatasetMapper(self.cfg,True)
)
))
# swap the order of PeriodicWriter and ValidationLoss
# code hangs with no GPUs > 1 if this line is removed
hooks = hooks[:-2] + hooks[-2:][::-1]
return hooks
I extended the code above to log both the train and val loss in the same graph in tensorboard. I put it here because i think it could be useful for others ending up here.
This is what your TB log will look like eventually
To do this, first create a custom tensorboard writer:
import os
from torch.utils.tensorboard import SummaryWriter
from detectron2.utils.events import EventWriter, get_event_storage
class CustomTensorboardXWriter(EventWriter):
"""
Writes scalars and images based on storage key to train or val tensorboard file.
"""
def __init__(self, log_dir: str, window_size: int = 20, **kwargs):
"""
Args:
log_dir (str): the base directory to save the output events. This class creates two subdirs in log_dir
window_size (int): the scalars will be median-smoothed by this window size
kwargs: other arguments passed to `torch.utils.tensorboard.SummaryWriter(...)`
"""
self._window_size = window_size
# separate the writers into a train and a val writer
train_writer_path = os.path.join(log_dir,"train")
os.makedirs(train_writer_path, exist_ok=True)
self._writer_train = SummaryWriter(train_writer_path, **kwargs)
val_writer_path = os.path.join(log_dir,"val")
os.makedirs(val_writer_path, exist_ok=True)
self._writer_val = SummaryWriter(val_writer_path, **kwargs)
def write(self):
storage = get_event_storage()
for k, (v, iter) in storage.latest_with_smoothing_hint(self._window_size).items():
if k.startswith("val_"):
k = k.replace("val_","")
self._writer_val.add_scalar(k, v, iter)
else:
self._writer_train.add_scalar(k, v, iter)
if len(storage._vis_data) >= 1:
for img_name, img, step_num in storage._vis_data:
if k.startswith("val_"):
k = k.replace("val_","")
self._writer_val.add_image(img_name, img, step_num)
else:
self._writer_train.add_image(img_name, img, step_num)
# Storage stores all image data and rely on this writer to clear them.
# As a result it assumes only one writer will use its image data.
# An alternative design is to let storage store limited recent
# data (e.g. only the most recent image) that all writers can access.
# In that case a writer may not see all image data if its period is long.
storage.clear_images()
if len(storage._histograms) >= 1:
for params in storage._histograms:
self._writer_train.add_histogram_raw(**params)
storage.clear_histograms()
def close(self):
if hasattr(self, "_writer"): # doesn't exist when the code fails at import
self._writer_train.close()
self._writer_val.close()
Then register this writer in your trainer. It will write plot train and val metrics in the same graph
class Trainer(DefaultTrainer):
@classmethod
def build_evaluator(cls, cfg, dataset_name, output_folder=None):
if output_folder is None:
output_folder = os.path.join(cfg.OUTPUT_DIR,"inference")
return COCOEvaluator(dataset_name, cfg, True, output_folder)
def build_writers(self):
"""
Overwrites the default writers to contain our custom tensorboard writer
Returns:
list[EventWriter]: a list of :class:`EventWriter` objects.
"""
return [
CommonMetricPrinter(self.max_iter),
JSONWriter(os.path.join(self.cfg.OUTPUT_DIR, "metrics.json")),
CustomTensorboardXWriter(self.cfg.OUTPUT_DIR),
]
Hi, all! Thanks for this great work. @marijnl, could you also share your train_net and your ValidationHook and how those tie in together? Many thanks!
As the Validation loss hook i use a slightly modified version of the code earlier in this thread.
import torch
from detectron2.data.build import build_detection_test_loader
from detectron2.engine import HookBase
import detectron2.utils.comm as comm
class ValLossHook(HookBase):
def __init__(self, cfg):
super().__init__()
self.cfg = cfg.clone()
self._loader = iter(build_detection_test_loader(self.cfg, "my_dataset_val"))
def after_step(self):
"""
After each step calculates the validation loss and adds it to the train storage
"""
data = next(self._loader)
with torch.no_grad():
loss_dict = self.trainer.model(data)
losses = sum(loss_dict.values())
assert torch.isfinite(losses).all(), loss_dict
loss_dict_reduced = {"val_" + k: v.item() for k, v in comm.reduce_dict(loss_dict).items()}
losses_reduced = sum(loss for loss in loss_dict_reduced.values())
if comm.is_main_process():
self.trainer.storage.put_scalars(val_total_loss=losses_reduced,
**loss_dict_reduced)
Then to tie things together i use
# setup trainer
trainer = Trainer(cfg)
# creates a hook that after each iter calculates the validation loss on the next batch
# Register the hoooks
trainer.register_hooks(
[ValLossHook(cfg)]
)
# The PeriodicWriter needs to be the last hook, otherwise it wont have access to valloss metrics
# Ensure PeriodicWriter is the last called hook
periodic_writer_hook = [hook for hook in trainer._hooks if isinstance(hook, PeriodicWriter)]
all_other_hooks = [hook for hook in trainer._hooks if not isinstance(hook, PeriodicWriter)]
trainer._hooks = all_other_hooks + periodic_writer_hook
trainer.resume_or_load(resume=args.resume)
@marijnl I get the following error when I try to register the ValLossHook you have specified here... any ideas? Thank you so much for this post, there is no other content I can find regarding posting these charts to tensorboard!
File "/workspace/product/science/visual_chunk_detector/.venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/workspace/product/science/visual_chunk_detector/.venv/lib/python3.7/site-packages/detectron2/modeling/proposal_generator/rpn.py", line 470, in forward
assert gt_instances is not None, "RPN requires gt_instances in training!"
AssertionError: RPN requires gt_instances in training!
Here is the relevant code I'm using:
class CustomTensorboardXWriter(EventWriter):
"""
Writes scalars and images based on storage key to train or val tensorboard file.
Reference: https://github.com/facebookresearch/detectron2/issues/810#issuecomment-933314459
"""
def __init__(self, log_dir: str, window_size: int = 20, **kwargs):
"""
Args:
log_dir (str): the base directory to save the output events. This class creates two subdirs in log_dir
window_size (int): the scalars will be median-smoothed by this window size
kwargs: other arguments passed to `torch.utils.tensorboard.SummaryWriter(...)`
"""
self._window_size = window_size
self.logger = logging.getLogger(__name__)
# separate the writers into a train and a val writer
train_writer_path = os.path.join(log_dir, "train")
os.makedirs(train_writer_path, exist_ok=True)
self._writer_train = SummaryWriter(train_writer_path, **kwargs)
val_writer_path = os.path.join(log_dir, "val")
os.makedirs(val_writer_path, exist_ok=True)
self._writer_val = SummaryWriter(val_writer_path, **kwargs)
def write(self):
storage = get_event_storage()
for k, (v, iter) in storage.latest_with_smoothing_hint(self._window_size).items():
if k.startswith("val_"):
k = k.replace("val_", "")
self._writer_val.add_scalar(k, v, iter)
else:
self._writer_train.add_scalar(k, v, iter)
if len(storage._vis_data) >= 1:
for img_name, img, step_num in storage._vis_data:
self.logger.info(f"processing key {k} with info {img_name}, {img}, {step_num}")
if k.startswith("val_"):
k = k.replace("val_", "")
self._writer_val.add_image(img_name, img, step_num)
else:
self._writer_train.add_image(img_name, img, step_num)
# Storage stores all image data and rely on this writer to clear them.
# As a result it assumes only one writer will use its image data.
# An alternative design is to let storage store limited recent
# data (e.g. only the most recent image) that all writers can access.
# In that case a writer may not see all image data if its period is long.
storage.clear_images()
if len(storage._histograms) >= 1:
for params in storage._histograms:
self._writer_train.add_histogram_raw(**params)
storage.clear_histograms()
def close(self):
if hasattr(self, "_writer"): # doesn't exist when the code fails at import
self._writer_train.close()
self._writer_val.close()
class ValLossHook(HookBase):
def __init__(self, cfg, validation_set_key):
super().__init__()
self.cfg = cfg.clone()
self._loader = iter(build_detection_test_loader(self.cfg, validation_set_key))
def after_step(self):
"""
After each step calculates the validation loss and adds it to the train storage
"""
data = next(self._loader)
with torch.no_grad():
loss_dict = self.trainer.model(data)
losses = sum(loss_dict.values())
assert torch.isfinite(losses).all(), loss_dict
loss_dict_reduced = {"val_" + k: v.item() for k, v in comm.reduce_dict(loss_dict).items()}
losses_reduced = sum(loss for loss in loss_dict_reduced.values())
if comm.is_main_process():
self.trainer.storage.put_scalars(val_total_loss=losses_reduced, **loss_dict_reduced)
class Trainer(DefaultTrainer):
@classmethod
def build_evaluator(cls, cfg, dataset_name, output_folder=None):
if output_folder is None:
output_folder = os.path.join(cfg.OUTPUT_DIR, "eval_output")
return COCOEvaluator(dataset_name, cfg, True, output_folder)
def build_writers(self):
"""
Overwrites the default writers to contain our custom tensorboard writer
Returns:
list[EventWriter]: a list of :class:`EventWriter` objects.
"""
return [
CommonMetricPrinter(self.max_iter),
JSONWriter(os.path.join(self.cfg.OUTPUT_DIR, "metrics.json")),
CustomTensorboardXWriter(self.cfg.OUTPUT_DIR),
]
# load the datasets
DatasetCatalog.clear()
register_coco_instances(self.DATASET_TRAIN_KEY, {}, str(train_annotations_json), str(train_images))
if eval_annotations_json and eval_images:
register_coco_instances(self.DATASET_EVAL_KEY, {}, str(eval_annotations_json), str(eval_images))
dataset_metadata: Metadata = MetadataCatalog.get(self.DATASET_TRAIN_KEY)
catalog = DatasetCatalog.get(
self.DATASET_TRAIN_KEY
) # we need to load the dataset to get the thing_classes to fully load in dataset_metadata
total_files = len(catalog)
thing_classes = dataset_metadata.get("thing_classes")
self._label_map = {}
for i in range(len(thing_classes)):
self._label_map[i] = thing_classes[i]
if not self._cfg:
# we are training a model from scratch, so use the default starter config from the model zoo
self._cfg = get_cfg()
self._cfg.merge_from_file(model_zoo.get_config_file(self.config_path))
# Let weights initialize from model zoo
self._cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(self.config_path)
if self._cfg.is_frozen():
raise Exception(
"This model is frozen, since it has been used for tagging. Please load a fresh config that has not been used for tagging, if you want to fine tune."
)
self.logger.info(f"Training {model_output_path} for {epochs} epochs with config {self.config_path}")
# Customize the config
self._cfg.MODEL.MASK_ON = False # we are only doing bounding boxes, no masks
# Calculate iterations (which are NOT the same as epochs) to get to the specified number of epochs
self._cfg.SOLVER.IMS_PER_BATCH = 2
one_epoch = int(total_files / self._cfg.SOLVER.IMS_PER_BATCH)
self._cfg.SOLVER.MAX_ITER = int(one_epoch * epochs)
self._cfg.SOLVER.BASE_LR = 0.001
self._cfg.SOLVER.STEPS = [] # do not decay learning rate
self._cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 512
self._cfg.MODEL.ROI_HEADS.NUM_CLASSES = len(thing_classes)
self._cfg.OUTPUT_DIR = str(model_output_path.absolute())
os.makedirs(self._cfg.OUTPUT_DIR, exist_ok=True)
self._cfg.DATALOADER.NUM_WORKERS = 2
self._cfg.DATASETS.TRAIN = (self.DATASET_TRAIN_KEY,)
if eval_annotations_json and eval_images:
self._cfg.DATASETS.TEST = (self.DATASET_EVAL_KEY,)
self._cfg.TEST.EVAL_PERIOD = one_epoch
self.logger.info(
f"Calculating validation loss every {self._cfg.TEST.EVAL_PERIOD} iterations against given eval data"
)
else:
self._cfg.DATASETS.TEST = ()
# Pick the correct training hardware
if torch.cuda.is_available():
self._cfg.MODEL.DEVICE = "cuda"
else:
self._cfg.MODEL.DEVICE = "cpu"
# Set up trainer
trainer = Trainer(self._cfg)
# creates a hook that after each iter calculates the validation loss on the next batch
trainer.register_hooks(
[ValLossHook(self._cfg, self.DATASET_EVAL_KEY)]
)
# The PeriodicWriter needs to be the last hook, otherwise it wont have access to valloss metrics
# Ensure PeriodicWriter is the last called hook
periodic_writer_hook = [hook for hook in trainer._hooks if isinstance(hook, PeriodicWriter)]
all_other_hooks = [hook for hook in trainer._hooks if not isinstance(hook, PeriodicWriter)]
trainer._hooks = all_other_hooks + periodic_writer_hook
# Start training
device = str(self._cfg.MODEL.DEVICE)
self.logger.info(f"Training running on {device}")
# We always want to use the updated config, and only load pretrained weights, hence set resume=False in all cases
# ref: https://detectron2.readthedocs.io/en/latest/modules/engine.html#detectron2.engine.defaults.DefaultTrainer.resume_or_load
trainer.resume_or_load(resume=False)
trainer.train()
# Evaluate, if eval data specified
if eval_annotations_json and eval_images:
self.logger.info(self.eval(eval_annotations_json, eval_images))
# Save output artifacts
final_model_path = Path(self._cfg.OUTPUT_DIR) / "model_final.pth"
final_config_path = Path(self._cfg.OUTPUT_DIR) / "config.yaml"
final_label_map_path = Path(self._cfg.OUTPUT_DIR) / "label_map.json"
You need a custom data mapper, something like this:
from detectron2.data import detection_utils as utils
from detectron2.data.build import (_test_loader_from_config, build_detection_train_loader)
def custom_test_mapper(dataset_dict):
# it will be modified by code below
dataset_dict = copy.deepcopy(dataset_dict)
image = utils.read_image(dataset_dict["file_name"], format="BGR")
transform_list = []
instances = utils.annotations_to_instances(annos, image.shape[:2])
dataset_dict["instances"] = utils.filter_empty_instances(instances)
return dataset_dict
def build_test_loader(cls, cfg, dataset_name="my_dataset_val"):
return build_detection_test_loader(cfg, dataset_name, mapper=get_custom_test_mapper())
Hi,
I have an hacky solution for this, I'll leave it here in case anyone needs it or someone has suggestions on how to improve it.
from detectron2.engine import HookBase from detectron2.data import build_detection_train_loader import detectron2.utils.comm as comm cfg.DATASETS.VAL = ("voc_2007_val",) class ValidationLoss(HookBase): def __init__(self, cfg): super().__init__() self.cfg = cfg.clone() self.cfg.DATASETS.TRAIN = cfg.DATASETS.VAL self._loader = iter(build_detection_train_loader(self.cfg)) def after_step(self): data = next(self._loader) with torch.no_grad(): loss_dict = self.trainer.model(data) losses = sum(loss_dict.values()) assert torch.isfinite(losses).all(), loss_dict loss_dict_reduced = {"val_" + k: v.item() for k, v in comm.reduce_dict(loss_dict).items()} losses_reduced = sum(loss for loss in loss_dict_reduced.values()) if comm.is_main_process(): self.trainer.storage.put_scalars(total_val_loss=losses_reduced, **loss_dict_reduced)
And then
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True) trainer = Trainer(cfg) val_loss = ValidationLoss(cfg) trainer.register_hooks([val_loss]) # swap the order of PeriodicWriter and ValidationLoss trainer._hooks = trainer._hooks[:-2] + trainer._hooks[-2:][::-1] trainer.resume_or_load(resume=False) trainer.train()
Hi,
How to write code to implement early stopping based on the validation loss?
It doesn't work on me
As the Validation loss hook i use a slightly modified version of the code earlier in this thread.
import torch from detectron2.data.build import build_detection_test_loader from detectron2.engine import HookBase import detectron2.utils.comm as comm class ValLossHook(HookBase): def __init__(self, cfg): super().__init__() self.cfg = cfg.clone() self._loader = iter(build_detection_test_loader(self.cfg, "my_dataset_val")) def after_step(self): """ After each step calculates the validation loss and adds it to the train storage """ data = next(self._loader) with torch.no_grad(): loss_dict = self.trainer.model(data) losses = sum(loss_dict.values()) assert torch.isfinite(losses).all(), loss_dict loss_dict_reduced = {"val_" + k: v.item() for k, v in comm.reduce_dict(loss_dict).items()} losses_reduced = sum(loss for loss in loss_dict_reduced.values()) if comm.is_main_process(): self.trainer.storage.put_scalars(val_total_loss=losses_reduced, **loss_dict_reduced)
Then to tie things together i use
# setup trainer trainer = Trainer(cfg) # creates a hook that after each iter calculates the validation loss on the next batch # Register the hoooks trainer.register_hooks( [ValLossHook(cfg)] ) # The PeriodicWriter needs to be the last hook, otherwise it wont have access to valloss metrics # Ensure PeriodicWriter is the last called hook periodic_writer_hook = [hook for hook in trainer._hooks if isinstance(hook, PeriodicWriter)] all_other_hooks = [hook for hook in trainer._hooks if not isinstance(hook, PeriodicWriter)] trainer._hooks = all_other_hooks + periodic_writer_hook trainer.resume_or_load(resume=args.resume)
@marijnl Hi, I tried implementing your custom tensorboard writer along with the Validation Loss hook and training settings you provided. However, I get the following error: 'NameError: name 'PeriodicWriter' is not defined' Any idea what the solution is to this?
Another solution for this problem with not really creating new custom dataset loader is to tell DatasetMapper()
to load ground truth along with it.
I prefer this way because I don't want to manipulate config and if you are, like me, freeze your config node.
class ValLossHook(HookBase):
def __init__(self, cfg, validation_set_key):
super().__init__()
self.cfg = cfg.clone()
self._loader = iter(build_detection_test_loader(self.cfg, self.cfg.DATASETS.TEST, mapper=DatasetMapper(self.cfg, is_train=True)))
def after_step(self):
"""
After each step calculates the validation loss and adds it to the train storage
"""
data = next(self._loader)
with torch.no_grad():
loss_dict = self.trainer.model(data)
losses = sum(loss_dict.values())
assert torch.isfinite(losses).all(), loss_dict
loss_dict_reduced = {"validation_" + k: v.item() for k, v in comm.reduce_dict(loss_dict).items()}
losses_reduced = sum(loss for loss in loss_dict_reduced.values())
if comm.is_main_process():
self.trainer.storage.put_scalars(validation_total_loss=losses_reduced, **loss_dict_reduced)
@mnslarcher don't you have to call model.eval()
in your after_step
method to notify the batchnorm and dropout layers to work in eval mode? Otherwise you get inconsistent results in different runs...
As the Validation loss hook i use a slightly modified version of the code earlier in this thread.
import torch from detectron2.data.build import build_detection_test_loader from detectron2.engine import HookBase import detectron2.utils.comm as comm class ValLossHook(HookBase): def __init__(self, cfg): super().__init__() self.cfg = cfg.clone() self._loader = iter(build_detection_test_loader(self.cfg, "my_dataset_val")) def after_step(self): """ After each step calculates the validation loss and adds it to the train storage """ data = next(self._loader) with torch.no_grad(): loss_dict = self.trainer.model(data) losses = sum(loss_dict.values()) assert torch.isfinite(losses).all(), loss_dict loss_dict_reduced = {"val_" + k: v.item() for k, v in comm.reduce_dict(loss_dict).items()} losses_reduced = sum(loss for loss in loss_dict_reduced.values()) if comm.is_main_process(): self.trainer.storage.put_scalars(val_total_loss=losses_reduced, **loss_dict_reduced)
Then to tie things together i use
# setup trainer trainer = Trainer(cfg) # creates a hook that after each iter calculates the validation loss on the next batch # Register the hoooks trainer.register_hooks( [ValLossHook(cfg)] ) # The PeriodicWriter needs to be the last hook, otherwise it wont have access to valloss metrics # Ensure PeriodicWriter is the last called hook periodic_writer_hook = [hook for hook in trainer._hooks if isinstance(hook, PeriodicWriter)] all_other_hooks = [hook for hook in trainer._hooks if not isinstance(hook, PeriodicWriter)] trainer._hooks = all_other_hooks + periodic_writer_hook trainer.resume_or_load(resume=args.resume)
@marijnl Hi, I tried implementing your custom tensorboard writer along with the Validation Loss hook and training settings you provided. However, I get the following error: 'NameError: name 'PeriodicWriter' is not defined' Any idea what the solution is to this?
Periodic writer is a child class of HookBase as defined in https://github.com/facebookresearch/detectron2/blob/main/detectron2/engine/hooks.py
With DatasetMapper
and setting is_train=True
, the code is throwing StopIteration
exception.
Please see below the ValLossHook
:
class ValLossHook(HookBase):
def __init__(self, cfg, validation_set_key):
super().__init__()
self.cfg = cfg.clone()
self._loader = iter(build_detection_test_loader(self.cfg, validation_set_key,
mapper=DatasetMapper(self.cfg, is_train=True),
num_workers=1))
def after_step(self):
"""
After each step calculates the validation loss and adds it to the train storage
"""
print(type(self._loader), len(self._loader)) # just for debugging
data = next(self._loader)
with torch.no_grad():
loss_dict = self.trainer.model(data)
losses = sum(loss_dict.values())
assert torch.isfinite(losses).all(), loss_dict
loss_dict_reduced = {"val_" + k: v.item() for k, v in comm.reduce_dict(loss_dict).items()}
losses_reduced = sum(loss for loss in loss_dict_reduced.values())
if comm.is_main_process():
self.trainer.storage.put_scalars(val_total_loss=losses_reduced,
**loss_dict_reduced)
Below is the error trace:
<class 'torch.utils.data.dataloader._MultiProcessingDataLoaderIter'> 5
<class 'torch.utils.data.dataloader._MultiProcessingDataLoaderIter'> 5
<class 'torch.utils.data.dataloader._MultiProcessingDataLoaderIter'> 5
<class 'torch.utils.data.dataloader._MultiProcessingDataLoaderIter'> 5
<class 'torch.utils.data.dataloader._MultiProcessingDataLoaderIter'> 5
<class 'torch.utils.data.dataloader._MultiProcessingDataLoaderIter'> 5
ERROR [04/20 18:17:12 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 150, in train
self.after_step()
File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 180, in after_step
h.after_step()
File "/home/ravi/detectron2_examples/train/val_loss_hook.py", line 37, in after_step
data = next(self._loader)
File "/home/ravi/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/home/ravi/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1176, in _next_data
raise StopIteration
StopIteration
What is wrong here?
Thanks a lot
With
DatasetMapper
and settingis_train=True
, the code is throwingStopIteration
exception.Please see below the
ValLossHook
:class ValLossHook(HookBase): def __init__(self, cfg, validation_set_key): super().__init__() self.cfg = cfg.clone() self._loader = iter(build_detection_test_loader(self.cfg, validation_set_key, mapper=DatasetMapper(self.cfg, is_train=True), num_workers=1)) def after_step(self): """ After each step calculates the validation loss and adds it to the train storage """ print(type(self._loader), len(self._loader)) # just for debugging data = next(self._loader) with torch.no_grad(): loss_dict = self.trainer.model(data) losses = sum(loss_dict.values()) assert torch.isfinite(losses).all(), loss_dict loss_dict_reduced = {"val_" + k: v.item() for k, v in comm.reduce_dict(loss_dict).items()} losses_reduced = sum(loss for loss in loss_dict_reduced.values()) if comm.is_main_process(): self.trainer.storage.put_scalars(val_total_loss=losses_reduced, **loss_dict_reduced)
Below is the error trace:
<class 'torch.utils.data.dataloader._MultiProcessingDataLoaderIter'> 5 <class 'torch.utils.data.dataloader._MultiProcessingDataLoaderIter'> 5 <class 'torch.utils.data.dataloader._MultiProcessingDataLoaderIter'> 5 <class 'torch.utils.data.dataloader._MultiProcessingDataLoaderIter'> 5 <class 'torch.utils.data.dataloader._MultiProcessingDataLoaderIter'> 5 <class 'torch.utils.data.dataloader._MultiProcessingDataLoaderIter'> 5 ERROR [04/20 18:17:12 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 150, in train self.after_step() File "/home/ravi/.local/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 180, in after_step h.after_step() File "/home/ravi/detectron2_examples/train/val_loss_hook.py", line 37, in after_step data = next(self._loader) File "/home/ravi/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__ data = self._next_data() File "/home/ravi/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1176, in _next_data raise StopIteration StopIteration
What is wrong here?
Thanks a lot
I think it is because you are using a build_detection_test_loader()
function here, which returns a torchdata.DataLoader
with produce only one batch of data, so in data = next(self._loader)
the iterator fails to get the data when the valisation set is consumed up.You can manually make a validation set which only contains one pictue to confirm this. My solution is to replace with build_detection_train_loader()
, which produce batched data, working in my project.
Please reference to build.py
@Lihewin
Thanks a lot. build_detection_train_loader
works like a charm.
BTW, do you have any suggestions/comment on https://github.com/facebookresearch/detectron2/issues/4922
I appriciate your time!
@Lihewin are you aware how with your code I could run the validation evaluation not after each iteration but for example after 1000?
@geotsl
One idea can be using a counter inside ValLossHook
and then a conditoinal statement inside after_step
enables validation. However, tensorflow may not be happy with it. A workaround is to write the validation loss to terminal.
@ppwwyyxx I'm sorry to drag you back to this closed issue but the continued activity in this thread suggests users are still having difficulty with this problem. Trying to calculate the validation loss is such a common use-case, it would be very useful to have an canonical response in the documentation. Would you be receptive to a pull request formalising one of the solutions above for inclusion in the docs?
@ppwwyyxx I would like to second this request!
@ppwwyyxx I'm sorry to drag you back to this closed issue but the continued activity in this thread suggests users are still having difficulty with this problem. Trying to calculate the validation loss is such a common use-case, it would be very useful to have an canonical response in the documentation. Would you be receptive to a pull request formalising one of the solutions above for inclusion in the docs?
@mnslarcher On a 4xGPU setup, if I tell it to calculate the validation loss on the same iteration as I calculate my coco_eval results, it hangs indefinitely, just before finishing the inference calculation. Every other iteration works except on the exact one where it is calculating the coco_eval inference. Just very strange behaviour. It also seems a bit odd that now I have to calculate inference on all of my validation data twice: once for the coco_eval results, and then on another iteration for calculating the validation loss. Both are doing inference and comparing them to ground truth: coco_eval produces AP results, and the other produces just validation losses. Would be nice to combine somehow, and figure out why it is breaking whenever I put the
cfg.TEST.EVAL_PERIOD
as the same iteration as where I am telling it to calculate the validation loss.Some other questions:
- On the iteration where I tell it to calculate the validation loss, is it just not calculating the normal total loss, and only calculating validation loss?
- Can you elaborate on what this does below? I am confused by why you have to index things this way?
trainer._hooks = trainer._hooks[:-2] + trainer._hooks[-2:][::-1]
- Is there any way I can verify that it is really getting losses from all 4 GPUs and combining them?
- Why do you not use
comm.synchronize()
? I thought this was necessary for 4 GPUs.Thanks!
hi, Have you solved the problem? Hope you are ok every day!
@xxxming730
What problem exactly you are trying to solve? Based on the info on this post and others, I was able to compute validation loss. I used only a single GPU, however.
A working code is available here: ravijo/detectron2_tutorial
First of all, thank you for your reply. I want to train on multiple graphics cards and output the validation loss and evaluation results, but after using the above code, the console can print the value of validation_loss, but there is no numerical record of validation loss in meters.json. So there is no validation loss curve in the plot after running PlotTogether.py. I was very troubled. It seemed that only my output was different from the output of others, because I did not see that others had the same question. Thank you again!
Marco Wei @.***
------------------ 原始邮件 ------------------ 发件人: "Ravi @.>; 发送时间: 2023年7月27日(星期四) 下午4:15 收件人: @.>; 抄送: "Marco @.>; @.>; 主题: Re: [facebookresearch/detectron2] How do I compute validation loss during training? (#810)
@xxxming730
你到底想解决什么问题?基于这篇文章和其他文章的信息,我能够计算验证损失。然而,我只用了一个GPU。
这里提供了一个工作代码:Ravijo/detector on 2_教程
— 直接回复此邮件,在GitHub上查看,或取消订阅. @.***与>.
@xxxming730
I see. I can't say about the training on multiple graphics cards, but maybe you want to try on a single GPU first and then scale it to multiple GPUs. Using ravijo/detectron2_tutorial, I was able to get the plots on Tensorboard. My training was enough for a single GPU, so I did not explore multiple GPUs.
Hope it helps
@xxxming730
I see. I can't say about the training on multiple graphics cards, but maybe you want to try on a single GPU first and then scale it to multiple GPUs. Using ravijo/detectron2_tutorial, I was able to get the plots on Tensorboard. My training was enough for a single GPU, so I did not explore multiple GPUs.
Hope it helps
Thank you very much, I will try as you said! Hope you are doing well every day!
@xxxming730
I see. I can't say about the training on multiple graphics cards, but maybe you want to try on a single GPU first and then scale it to multiple GPUs. Using ravijo/detectron2_tutorial, I was able to get the plots on Tensorboard. My training was enough for a single GPU, so I did not explore multiple GPUs.
Hope it helps
@ravijo Hi,I am using this method below:
Hi, I have written it and commented the code, you can see it here: https://medium.com/@apofeniaco/training-on-detectron2-with-a-validation-set-and-plot-loss-on-it-to-avoid-overfitting-6449418fbf4e or just the gist here: https://gist.github.com/ortegatron/c0dad15e49c2b74de8bb09a5615d9f6b
A single GPU is fine, I can output validation_loss and evaluation results and log it, but when training on multiple Gpus, the record related to the output is wrong, I think I just need to try to modify and look for this part of the code about the output storage, thanks again for helping me!
How do I compute validation loss during training?
I'm trying to compute the loss on a validation dataset for each iteration during training. To do so, I've created my own hook:
... which I register with a DefaultTrainer. The hook code is called during training, but fails with the following:
The traceback seems to imply that ground truth data is missing, which made me think that the data loader was the problem. However, switching to a training loader produces a different error:
As a sanity check, inference works just fine:
... but that isn't what I want, of course. Any thoughts?
Thanks in advance, Tim