报错 - Githubissues

Aaron-huangyangyang commented 3 months ago

Traceback (most recent call last): File "train_net.py", line 281, in launch( File "/home/mrp/Aaron/Project/hyy/MEPU/detectron2/detectron2/engine/launch.py", line 69, in launch mp.start_processes( File "/home/mrp/miniconda3/envs/orth/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/mrp/miniconda3/envs/orth/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/mrp/miniconda3/envs/orth/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/home/mrp/Aaron/Project/hyy/MEPU/detectron2/detectron2/engine/launch.py", line 123, in _distributed_worker main_func(args) File "/home/mrp/HYY/OrthogonalDet/train_net.py", line 272, in main trainer.resume_or_load(resume=args.resume) File "/home/mrp/Aaron/Project/hyy/MEPU/detectron2/detectron2/engine/defaults.py", line 414, in resume_or_load self.checkpointer.resume_or_load(self.cfg.MODEL.WEIGHTS, resume=resume) File "/home/mrp/miniconda3/envs/orth/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 225, in resume_or_load return self.load(path) File "/home/mrp/Aaron/Project/hyy/MEPU/detectron2/detectron2/checkpoint/detection_checkpoint.py", line 62, in load ret = super().load(path, *args, **kwargs) File "/home/mrp/miniconda3/envs/orth/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 166, in load obj.load_state_dict(checkpoint.pop(key)) File "/home/mrp/Aaron/Project/hyy/MEPU/detectron2/detectron2/engine/defaults.py", line 505, in load_state_dict self._trainer.load_state_dict(state_dict["_trainer"]) File "/home/mrp/Aaron/Project/hyy/MEPU/detectron2/detectron2/engine/train_loop.py", line 430, in load_state_dict self.optimizer.load_state_dict(state_dict["optimizer"]) File "/home/mrp/miniconda3/envs/orth/lib/python3.8/site-packages/torch/optim/optimizer.py", line 141, in load_state_dict raise ValueError("loaded state dict has a different number of " ValueError: loaded state dict has a different number of parameter groups

请问这个是什么问题呢

feifeiobama commented 3 months ago

This error seems very similar to https://github.com/feifeiobama/OrthogonalDet/issues/2. Please checkout this reply https://github.com/feifeiobama/OrthogonalDet/issues/2#issuecomment-2181978357.

Aaron-huangyangyang commented 3 months ago

thank you for your reply,I will follow up on this issue under #2

Aaron-huangyangyang commented 3 months ago

Hi, thank you very much for your reply, I tried the method #2, but the effect in task 2 is very poor, as shown in the screenshot above, may I ask what is the problem with me? Task 1 is normal

feifeiobama commented 3 months ago

Your results for Task 2 are close to the second row of Table 4b in the paper, which removes the calibration design. Could you use git diff to check if the code or configuration related to the calibration design is changed?

If the code is fine, I think you may be auto-resuming from a wrong checkpoint. I suggest deleting the current checkpoint folder and re-running the experiments for Task 1 and Task 2. Hope to hear from you soon!

Aaron-huangyangyang commented 3 months ago

thank you ,I will try it

GitHubAaronhuang commented 2 months ago

Dear Author, In Task 1, I attempted to replicate the results of your paper using 2*3090 GPUs. The mAP for known classes is 56.3, and the U-Recall for unknown classes is 24.6. My experimental setup includes a batch size of 6, a learning rate of 0.0000125, and 40,000 iterations. However, I was unable to achieve the results reported in the original paper. Could you please advise on why this might be? Below are my experimental data:

[08/16 14:58:20 d2.evaluation.evaluator]: Inference done 5111/5123. Dataloading: 0.0010 s/iter. Inference: 0.3472 s/iter. Eval: 0.0003 s/iter. Total: 0.3484 s/iter. ETA=0:00:04 [08/16 14:58:25 d2.evaluation.evaluator]: Total inference time: 0:29:44.255176 (0.348624 s / iter per device, on 2 devices) [08/16 14:58:25 d2.evaluation.evaluator]: Total inference pure compute time: 0:29:36 (0.347161 s / iter per device, on 2 devices) [08/16 15:00:22 core.pascal_voc_evaluation]: Evaluating my_val using 2012 metric. Note that results do not use the official Matlab API. [08/16 15:00:22 core.pascal_voc_evaluation]: aeroplane has 1211 predictions. [08/16 15:00:22 core.pascal_voc_evaluation]: bicycle has 2479 predictions. [08/16 15:00:22 core.pascal_voc_evaluation]: bird has 2867 predictions. [08/16 15:00:22 core.pascal_voc_evaluation]: boat has 2746 predictions. [08/16 15:00:22 core.pascal_voc_evaluation]: bottle has 9826 predictions. [08/16 15:00:23 core.pascal_voc_evaluation]: bus has 976 predictions. [08/16 15:00:23 core.pascal_voc_evaluation]: car has 13807 predictions. [08/16 15:00:23 core.pascal_voc_evaluation]: cat has 1136 predictions. [08/16 15:00:23 core.pascal_voc_evaluation]: chair has 14496 predictions. [08/16 15:00:24 core.pascal_voc_evaluation]: cow has 1936 predictions. [08/16 15:00:24 core.pascal_voc_evaluation]: diningtable has 2962 predictions. [08/16 15:00:24 core.pascal_voc_evaluation]: dog has 1660 predictions. [08/16 15:00:24 core.pascal_voc_evaluation]: horse has 1310 predictions. [08/16 15:00:25 core.pascal_voc_evaluation]: motorbike has 1615 predictions. [08/16 15:00:25 core.pascal_voc_evaluation]: person has 89446 predictions. [08/16 15:00:28 core.pascal_voc_evaluation]: pottedplant has 6267 predictions. [08/16 15:00:28 core.pascal_voc_evaluation]: sheep has 1833 predictions. [08/16 15:00:28 core.pascal_voc_evaluation]: sofa has 1953 predictions. [08/16 15:00:29 core.pascal_voc_evaluation]: train has 947 predictions. [08/16 15:00:29 core.pascal_voc_evaluation]: tvmonitor has 2545 predictions. [08/16 15:00:29 core.pascal_voc_evaluation]: truck has 1 predictions. [08/16 15:00:29 core.pascal_voc_evaluation]: traffic light has 1 predictions. [08/16 15:00:29 core.pascal_voc_evaluation]: fire hydrant has 1 predictions. [08/16 15:00:29 core.pascal_voc_evaluation]: stop sign has 1 predictions. [08/16 15:00:29 core.pascal_voc_evaluation]: parking meter has 1 predictions. [08/16 15:00:29 core.pascal_voc_evaluation]: bench has 1 predictions. [08/16 15:00:30 core.pascal_voc_evaluation]: elephant has 1 predictions. [08/16 15:00:30 core.pascal_voc_evaluation]: bear has 1 predictions. [08/16 15:00:30 core.pascal_voc_evaluation]: zebra has 1 predictions. [08/16 15:00:30 core.pascal_voc_evaluation]: giraffe has 1 predictions. [08/16 15:00:30 core.pascal_voc_evaluation]: backpack has 1 predictions. [08/16 15:00:30 core.pascal_voc_evaluation]: umbrella has 1 predictions. [08/16 15:00:30 core.pascal_voc_evaluation]: handbag has 1 predictions. [08/16 15:00:30 core.pascal_voc_evaluation]: tie has 1 predictions. [08/16 15:00:31 core.pascal_voc_evaluation]: suitcase has 1 predictions. [08/16 15:00:31 core.pascal_voc_evaluation]: microwave has 1 predictions. [08/16 15:00:31 core.pascal_voc_evaluation]: oven has 1 predictions. [08/16 15:00:31 core.pascal_voc_evaluation]: toaster has 1 predictions. [08/16 15:00:31 core.pascal_voc_evaluation]: sink has 1 predictions. [08/16 15:00:31 core.pascal_voc_evaluation]: refrigerator has 1 predictions. [08/16 15:00:31 core.pascal_voc_evaluation]: frisbee has 1 predictions. [08/16 15:00:31 core.pascal_voc_evaluation]: skis has 1 predictions. [08/16 15:00:31 core.pascal_voc_evaluation]: snowboard has 1 predictions. [08/16 15:00:32 core.pascal_voc_evaluation]: sports ball has 1 predictions. [08/16 15:00:32 core.pascal_voc_evaluation]: kite has 1 predictions. [08/16 15:00:32 core.pascal_voc_evaluation]: baseball bat has 1 predictions. [08/16 15:00:32 core.pascal_voc_evaluation]: baseball glove has 1 predictions. [08/16 15:00:32 core.pascal_voc_evaluation]: skateboard has 1 predictions. [08/16 15:00:32 core.pascal_voc_evaluation]: surfboard has 1 predictions. [08/16 15:00:32 core.pascal_voc_evaluation]: tennis racket has 1 predictions. [08/16 15:00:32 core.pascal_voc_evaluation]: banana has 1 predictions. [08/16 15:00:33 core.pascal_voc_evaluation]: apple has 1 predictions. [08/16 15:00:33 core.pascal_voc_evaluation]: sandwich has 1 predictions. [08/16 15:00:33 core.pascal_voc_evaluation]: orange has 1 predictions. [08/16 15:00:33 core.pascal_voc_evaluation]: broccoli has 1 predictions. [08/16 15:00:33 core.pascal_voc_evaluation]: carrot has 1 predictions. [08/16 15:00:33 core.pascal_voc_evaluation]: hot dog has 1 predictions. [08/16 15:00:33 core.pascal_voc_evaluation]: pizza has 1 predictions. [08/16 15:00:33 core.pascal_voc_evaluation]: donut has 1 predictions. [08/16 15:00:33 core.pascal_voc_evaluation]: cake has 1 predictions. [08/16 15:00:34 core.pascal_voc_evaluation]: bed has 1 predictions. [08/16 15:00:34 core.pascal_voc_evaluation]: toilet has 1 predictions. [08/16 15:00:34 core.pascal_voc_evaluation]: laptop has 1 predictions. [08/16 15:00:34 core.pascal_voc_evaluation]: mouse has 1 predictions. [08/16 15:00:34 core.pascal_voc_evaluation]: remote has 1 predictions. [08/16 15:00:34 core.pascal_voc_evaluation]: keyboard has 1 predictions. [08/16 15:00:34 core.pascal_voc_evaluation]: cell phone has 1 predictions. [08/16 15:00:34 core.pascal_voc_evaluation]: book has 1 predictions. [08/16 15:00:35 core.pascal_voc_evaluation]: clock has 1 predictions. [08/16 15:00:35 core.pascal_voc_evaluation]: vase has 1 predictions. [08/16 15:00:35 core.pascal_voc_evaluation]: scissors has 1 predictions. [08/16 15:00:35 core.pascal_voc_evaluation]: teddy bear has 1 predictions. [08/16 15:00:35 core.pascal_voc_evaluation]: hair drier has 1 predictions. [08/16 15:00:35 core.pascal_voc_evaluation]: toothbrush has 1 predictions. [08/16 15:00:35 core.pascal_voc_evaluation]: wine glass has 1 predictions. [08/16 15:00:35 core.pascal_voc_evaluation]: cup has 1 predictions. [08/16 15:00:35 core.pascal_voc_evaluation]: fork has 1 predictions. [08/16 15:00:36 core.pascal_voc_evaluation]: knife has 1 predictions. [08/16 15:00:36 core.pascal_voc_evaluation]: spoon has 1 predictions. [08/16 15:00:36 core.pascal_voc_evaluation]: bowl has 1 predictions. [08/16 15:00:36 core.pascal_voc_evaluation]: unknown has 125604 predictions. [08/16 15:00:39 core.pascal_voc_evaluation]: Wilderness Impact: {0.1: {50: 0.016709147550269046}, 0.2: {50: 0.025452637103122535}, 0.3: {50: 0.033592128526404794}, 0.4: {50: 0.03633155370554701}, 0.5: {50: 0.03828654154387053}, 0.6: {50: 0.037859079083813925}, 0.7: {50: 0.03297283644462813}, 0.8: {50: 0.024514630016146184}, 0.9: {50: 0.02672367717797969}} [08/16 15:00:39 core.pascal_voc_evaluation]: avg_precision: {0.1: {50: 0.09522357103641799}, 0.2: {50: 0.047256374538186666}, 0.3: {50: 0.030528999362651372}, 0.4: {50: 0.030528999362651372}, 0.5: {50: 0.030528999362651372}, 0.6: {50: 0.030528999362651372}, 0.7: {50: 0.030528999362651372}, 0.8: {50: 0.030528999362651372}, 0.9: {50: 0.030528999362651372}} [08/16 15:00:39 core.pascal_voc_evaluation]: Absolute OSE (total_num_unk_det_as_known): {50: 4322.0} [08/16 15:00:39 core.pascal_voc_evaluation]: total_num_unk 15606 [08/16 15:00:39 core.pascal_voc_evaluation]: ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'truck', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'bed', 'toilet', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'unknown'] [08/16 15:00:39 core.pascal_voc_evaluation]: AP50: ['78.4', '58.2', '61.9', '39.6', '31.7', '66.7', '54.7', '82.3', '22.0', '63.9', '19.8', '77.8', '75.7', '64.5', '52.5', '23.7', '68.2', '46.9', '78.3', '59.6', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '2.2'] [08/16 15:00:39 core.pascal_voc_evaluation]: Precisions50: ['25.0', '17.0', '17.9', '11.3', '8.6', '30.9', '14.8', '40.4', '8.5', '13.9', '15.4', '35.3', '28.3', '25.8', '12.6', '8.9', '15.0', '18.8', '33.8', '18.2', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '3.1'] [08/16 15:00:39 core.pascal_voc_evaluation]: Recall50: ['89.3', '72.3', '75.6', '66.0', '58.3', '79.0', '70.4', '93.3', '48.0', '83.6', '50.3', '90.7', '88.4', '78.4', '78.6', '69.2', '86.1', '73.1', '88.7', '81.8', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '24.6'] [08/16 15:00:39 core.pascal_voc_evaluation]: Current class AP50: 56.320744998302885 [08/16 15:00:39 core.pascal_voc_evaluation]: Current class Precisions50: 20.026987413894506 [08/16 15:00:39 core.pascal_voc_evaluation]: Current class Recall50: 76.05670949281635 [08/16 15:00:39 core.pascal_voc_evaluation]: Known AP50: 56.320744998302885 [08/16 15:00:39 core.pascal_voc_evaluation]: Known Precisions50: 20.026987413894506 [08/16 15:00:39 core.pascal_voc_evaluation]: Known Recall50: 76.05670949281635 [08/16 15:00:39 core.pascal_voc_evaluation]: Unknown AP50: 2.189276604752284 [08/16 15:00:39 core.pascal_voc_evaluation]: Unknown Precisions50: 3.0508582529218815 [08/16 15:00:39 core.pascal_voc_evaluation]: Unknown Recall50: 24.554658464693066 [08/16 15:00:39 d2.engine.defaults]: Evaluation results for my_val in csv format: [08/16 15:00:39 d2.evaluation.testing]: copypaste: Task: bbox [08/16 15:00:39 d2.evaluation.testing]: copypaste: AP,AP50 [08/16 15:00:39 d2.evaluation.testing]: copypaste: 13.9334,13.9334

feifeiobama commented 2 months ago

This result is likely due to the small batch size, which reduces the effectiveness of the "prediction orthogonalization" design in Section 3.4. Please see a similar issue in https://github.com/feifeiobama/OrthogonalDet/issues/2#issuecomment-2191792254 and the subsequent replies

GitHubAaronhuang commented 2 months ago

Will removing prediction orthogonalization result in further performance degradation? I initially trained with a batch size of 12, but the results were still U-Recall of 22.1 and mAP of 61.8.

feifeiobama commented 2 months ago

I think your results with a batch size of 12 are pretty close to the results in our papers. Please keep this batch size and the prediction orthogonalization loss if you have sufficient resources.

GitHubAaronhuang commented 2 months ago

Thank you very much for your response. I only have two 3090 GPUs available, so I will try increasing the batch size and see how it goes.

GitHubAaronhuang commented 2 months ago

Dear author, is it possible to optimize the code so as not to be affected by the large batchsize? When I ran Task 2 again, I found that my code made an error again, which may be caused by the large batchsize, but it is too small to produce good results. The following is my device information and error information

RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 1800000ms for send operation to complete

/home/mrp/anaconda3/envs/mepu/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 40 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

feifeiobama commented 2 months ago

You can try to modify the code to support FP16 training, which halves the memory cost. It should look like this:

with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):
    # training loop

GitHubAaronhuang commented 2 months ago

thinks you ,I will try

feifeiobama / OrthogonalDet

报错 #5