facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
https://detectron2.readthedocs.io/en/latest/
Apache License 2.0
30.55k stars 7.49k forks source link

Train_net.py code error again and again import _C error #3163

Closed leesangjoon1 closed 3 years ago

leesangjoon1 commented 3 years ago

Instructions To Reproduce the 🐛 Bug:

  1. Full runnable code or full changes you made:
    
    If making changes to the project itself, please use output of the following command:
    git rev-parse HEAD; git diff
#!/usr/bin/env python # Copyright (c) Facebook, Inc. and its affiliates. """ Detection Training Script. This scripts reads a given config file and runs the training or evaluation. It is an entry point that is made to train standard models in detectron2. In order to let one script support training of many models, this script contains logic that are specific to these built-in models and therefore may not be suitable for your own project. For example, your research project perhaps only needs a single "evaluator". Therefore, we recommend you to use detectron2 as an library and take this file as an example of how to use the library. You may want to write your own script with your datasets and other customizations. """ import logging import os from collections import OrderedDict import torch import detectron2.utils.comm as comm from detectron2.checkpoint import DetectionCheckpointer from detectron2.config import get_cfg from detectron2.data import MetadataCatalog from detectron2.engine import DefaultTrainer, default_argument_parser, default_setup, hooks, launch from detectron2.evaluation import ( CityscapesInstanceEvaluator, CityscapesSemSegEvaluator, COCOEvaluator, COCOPanopticEvaluator, DatasetEvaluators, LVISEvaluator, PascalVOCDetectionEvaluator, SemSegEvaluator, verify_results, ) from detectron2.modeling import GeneralizedRCNNWithTTA from detectron2.modeling import GeneralizedRCNNWithTTA from detectron2.data.datasets import register_coco_instances register_coco_instances("data_train", {}, "/home/sangjoon/detectron2/sangjoon/for_newthing_0331/white_train.json", "/home/sangjoon/detectron2/sangjoon/white_train2020") register_coco_instances("data_val", {}, "/home/sangjoon/detectron2/sangjoon/for_newthing_0331/white_test.json", "/home/sangjoon/detectron2/sangjoon/white_test2020") class Trainer(DefaultTrainer): """ We use the "DefaultTrainer" which contains pre-defined default logic for standard training workflow. They may not work for you, especially if you are working on a new research project. In that case you can write your own training loop. You can use "tools/plain_train_net.py" as an example. """ @classmethod def build_evaluator(cls, cfg, dataset_name, output_folder=None): """ Create evaluator(s) for a given dataset. This uses the special metadata "evaluator_type" associated with each builtin dataset. For your own dataset, you can simply create an evaluator manually in your script and do not have to worry about the hacky if-else logic here. """ if output_folder is None: output_folder = os.path.join(cfg.OUTPUT_DIR, "inference") evaluator_list = [] evaluator_type = MetadataCatalog.get(dataset_name).evaluator_type if evaluator_type in ["sem_seg", "coco_panoptic_seg"]: evaluator_list.append( SemSegEvaluator( dataset_name, distributed=True, num_classes=cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES, ignore_label=cfg.MODEL.SEM_SEG_HEAD.IGNORE_VALUE, output_dir=output_folder, ) ) if evaluator_type in ["coco", "coco_panoptic_seg"]: evaluator_list.append(COCOEvaluator(dataset_name, cfg, True, output_folder)) if evaluator_type == "coco_panoptic_seg": evaluator_list.append(COCOPanopticEvaluator(dataset_name, output_folder)) if evaluator_type == "cityscapes_instance": assert ( torch.cuda.device_count() >= comm.get_rank() ), "CityscapesEvaluator currently do not work with multiple machines." return CityscapesInstanceEvaluator(dataset_name) if evaluator_type == "cityscapes_sem_seg": assert ( torch.cuda.device_count() >= comm.get_rank() ), "CityscapesEvaluator currently do not work with multiple machines." return CityscapesSemSegEvaluator(dataset_name) elif evaluator_type == "pascal_voc": return PascalVOCDetectionEvaluator(dataset_name) elif evaluator_type == "lvis": return LVISEvaluator(dataset_name, cfg, True, output_folder) if len(evaluator_list) == 0: raise NotImplementedError( "no Evaluator for the dataset {} with the type {}".format( dataset_name, evaluator_type ) ) elif len(evaluator_list) == 1: return evaluator_list[0] return DatasetEvaluators(evaluator_list) @classmethod def test_with_TTA(cls, cfg, model): logger = logging.getLogger("detectron2.trainer") # In the end of training, run an evaluation with TTA # Only support some R-CNN models. logger.info("Running inference with test-time augmentation ...") model = GeneralizedRCNNWithTTA(cfg, model) evaluators = [ cls.build_evaluator( cfg, name, output_folder=os.path.join(cfg.OUTPUT_DIR, "inference_TTA") ) for name in cfg.DATASETS.TEST ] res = cls.test(cfg, model, evaluators) res = OrderedDict({k + "_TTA": v for k, v in res.items()}) return res def setup(args): """ Create configs and perform basic setups. """ cfg = get_cfg() cfg.merge_from_file(args.config_file) cfg.merge_from_list(args.opts) cfg.freeze() default_setup(cfg, args) return cfg def main(args): cfg = setup(args) if args.eval_only: model = Trainer.build_model(cfg) DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load( cfg.MODEL.WEIGHTS, resume=args.resume ) res = Trainer.test(cfg, model) if cfg.TEST.AUG.ENABLED: res.update(Trainer.test_with_TTA(cfg, model)) if comm.is_main_process(): verify_results(cfg, res) return res """ If you'd like to do anything fancier than the standard training logic, consider writing your own training loop (see plain_train_net.py) or subclassing the trainer. """ trainer = Trainer(cfg) trainer.resume_or_load(resume=args.resume) if cfg.TEST.AUG.ENABLED: trainer.register_hooks( [hooks.EvalHook(0, lambda: trainer.test_with_TTA(cfg, trainer.model))] ) return trainer.train() if __name__ == "__main__": args = default_argument_parser().parse_args() print("Command Line Args:", args) launch( main, args.num_gpus, num_machines=args.num_machines, machine_rank=args.machine_rank, dist_url=args.dist_url, args=(args,), ) ``` 2. What exact command you run: python train_net.py --num-gpus 4 --config-file /home/sangjoon/detectron2/configs/sangjoon.yaml 3. __Full logs__ or other relevant observations: ``` Traceback (most recent call last): File "train_net.py", line 27, in from detectron2.data import MetadataCatalog File "/home/sangjoon/detectron2/detectron2/data/__init__.py", line 4, in from .build import ( File "/home/sangjoon/detectron2/detectron2/data/build.py", line 12, in from detectron2.structures import BoxMode File "/home/sangjoon/detectron2/detectron2/structures/__init__.py", line 6, in from .keypoints import Keypoints, heatmaps_to_keypoints File "/home/sangjoon/detectron2/detectron2/structures/keypoints.py", line 6, in from detectron2.layers import interpolate File "/home/sangjoon/detectron2/detectron2/layers/__init__.py", line 3, in from .deform_conv import DeformConv, ModulatedDeformConv File "/home/sangjoon/detectron2/detectron2/layers/deform_conv.py", line 10, in from detectron2 import _C ImportError: /home/sangjoon/detectron2/detectron2/_C.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe28TypeMeta21_typeMetaDataInstanceIdEEPKNS_6detail12TypeMetaDataEv ``` 4. please simplify the steps as much as possible so they do not require additional resources to run, such as a private dataset. ## Expected behavior: If there are no obvious error in "full logs" provided above, please tell us the expected behavior. ## Environment: Provide your environment information using the following command: ``` wget -nc -q https://github.com/facebookresearch/detectron2/raw/master/detectron2/utils/collect_env.py && python collect_env.py torch version is 1.8.0 cuda version is 10.2 I don't know cudnn is installed or not but it was working ``` If your issue looks like an installation issue / environment issue, please first try to solve it yourself with the instructions in https://detectron2.readthedocs.io/tutorials/install.html#common-installation-issues I was running the deep learning training using detectron2 but after I set the yolo environment, it doesn't works. I don't know it is the reason also. I want to solve this problem please teach me . Thank you
ppwwyyxx commented 3 years ago

https://detectron2.readthedocs.io/en/latest/tutorials/install.html#common-installation-issues has answers:

Undefined symbols that contains TH,aten,torch,caffe2; Missing torch dynamic libraries; Segmentation fault immediately when using detectron2. This usually happens when detectron2 or torchvision is not compiled with the version of PyTorch you’re running. If the error comes from a pre-built torchvision, uninstall torchvision and pytorch and reinstall them following pytorch.org. So the versions will match. If the error comes from a pre-built detectron2, check release notes to see the corresponding pytorch version required for each pre-built detectron2. Or uninstall and reinstall the correct pre-built detectron2. If the error comes from detectron2 or torchvision that you built manually from source, remove files you built (build/, */.so) and rebuild it so it can pick up the version of pytorch currently in your environment.