CanPeng123 / Faster-ILOD

45 stars 7 forks source link

Cannot run the code successfully #1

Closed wassryan closed 3 years ago

wassryan commented 3 years ago

Hi, guys. It seems your code has some problem, have you check the code before releasing it? For the first step to obatin the teacher model, i run the comandpython tools/train_first_step.py --config-file ./configs/e2e_faster_rcnn_R_50_C4_1x_Source_model.yaml after modifing the WEIGHT from xx/incremental_learning_ResNet50_C4/RPN_first_10_classes_40k_steps/model_final.pth to catalog://ImageNetPretrained/MSRA/R-50 (I think it's correct if i need to run the code), then I met the problem of

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [16,0,0] Assertion `t >= 0 && t < n_classes` failed.

I dig into the code, and localize the problem: /xx/Faster-ILOD/maskrcnn_benchmark/modeling/roi_heads/box_head/loss.py", line 147: classification_loss = F.cross_entropy(class_logits, labels)

I print out the labels and find there is some value larger than class_logits.size(1)(=11), some value even equals to 15, which is out of index of class_logits.size(1).

Can you guys help me clarify the problem? Thanks!

CanPeng123 commented 3 years ago

Hi

For the first step running, you also need to modify the following setting on the config file: WEIGHT, NUM_CLASSES, NAME_OLD_CLASSES, NAME_NEW_CLASSES, NAME_EXCLUDED_CLASSES

Normally I use e2e_faster_rcnn_R_50_C4_1x.yaml for first step running and use e2e_faster_rcnn_R_50_C4_1x_Source_model.yaml for source model loading on the following incremental steps.

You can use e2e_faster_rcnn_R_50_C4_1x.yaml as an example for first step running.

wassryan commented 3 years ago

Hi, thanks for your reply. I follow the step your mentioned

python tools/train_first_step.py -config-file ./configs/e2e_faster_rcnn_R_50_C4_1x.yaml

and i encountered another error:

Traceback (most recent call last):
  File "tools/train_first_step.py", line 233, in <module>
    main()
  File "tools/train_first_step.py", line 225, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_first_step.py", line 105, in train
    arguments,
  File "/home/xx/Faster-ILOD/maskrcnn_benchmark/engine/trainer.py", line 72, in do_train
    losses = sum(loss for loss in loss_dict.values())
AttributeError: 'tuple' object has no attribute 'values'

what's wrong with the code? have you encounter the same error before?

p.s. I compile the maskrcnn_benchmark in your repo using setup.py in the official maskrcnn_benchmark.

Here is my environment: pytorch 1.1.0 py3.7_cuda9.0.176_cudnn7.5.1_0 pytorch-nightly 1.0.0 torch 1.1.0 torchvision 0.3.0

wassryan commented 3 years ago

Hi,

I've localized the problem, it seems that you've changed return parameters as tuple in generalized_rcnn.py, so in do_train i need to get the first element in loss_dict to compute the loss.

Another question: after run the cmd python tools/train_first_step.py -config-file ./configs/e2e_faster_rcnn_R_50_C4_1x.yaml, when training, I met another problem, which indicates some label is 20, which exceeds class_logits's 2nd dim. How did this problem happen? Can you check the code you pushed to make sure that the code can run successfully on your server. Thanks!

-> classification_loss = F.cross_entropy(class_logits, labels)
>>> class_logits.shape
torch.Size([512, 20])
>>> labels
tensor([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 20,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0, 20,  0, 20,  0,  0,  0,  0,  0,  0,  0,  0,  0, 20,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 20,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0, 20,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 20,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0, 20,  0,  0,  0,  0,  0, 20,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 20,  0,
         0,  0,  0,  0, 12,  0,  0, 20,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0, 20,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0, 20,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 20,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0, 20,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0, 20,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 20,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0, 20, 12], device='cuda:0')
CanPeng123 commented 3 years ago

have you modified the e2e_faster_rcnn_R_50_C4_1x.yaml properly?

wassryan commented 3 years ago

Of course, this is my yaml:

MODEL:
  META_ARCHITECTURE: "GeneralizedRCNN"
  WEIGHT: "catalog://ImageNetPretrained/MSRA/R-50"
  BACKBONE:
    CONV_BODY: "R-50-C4"
  RESNETS:
    BACKBONE_OUT_CHANNELS: 1024
  RPN:
    USE_FPN: False
    ANCHOR_STRIDE: (16,)
    PRE_NMS_TOP_N_TRAIN: 12000
    PRE_NMS_TOP_N_TEST: 6000
    POST_NMS_TOP_N_TRAIN: 2000
    POST_NMS_TOP_N_TEST: 1000
    EXTERNAL_PROPOSAL: False
  ROI_HEADS:
    USE_FPN: False
  ROI_BOX_HEAD:
    POOLER_RESOLUTION: 7
    POOLER_SCALES: (0.0625,)
    POOLER_SAMPLING_RATIO: 2
    FEATURE_EXTRACTOR: "ResNet50Conv5ROIFeatureExtractor"
    PREDICTOR: "FastRCNNPredictor"
    NUM_CLASSES: 20 # total classes : 19 + 1
    NAME_OLD_CLASSES: []
    NAME_NEW_CLASSES: ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog",
                       "horse", "motorbike", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]
    NAME_EXCLUDED_CLASSES: ["person"]
DATASETS:
  TRAIN: ("voc_2007_train", "voc_2007_val")
  TEST: ("voc_2007_test",)
DATALOADER:
  SIZE_DIVISIBILITY: 32
SOLVER:
  BASE_LR: 0.001 # start learning rate
  WEIGHT_DECAY: 0.0001
  GAMMA: 0.1  # learning rate decay
  STEPS: (30000,)
  MAX_ITER: 40000 # number of iteration
  CHECKPOINT_PERIOD: 2500 # number of iteration to generate check point
  IMS_PER_BATCH: 1 # number of images per batch
  MOMENTUM: 0.9
TEST: # testing strategy
  IMS_PER_BATCH: 1 # number of images per batch
OUTPUT_DIR: "/home/xx/Faster-ILOD/incremental_learning_ResNet50_C4/RPN_19_classes_40k_steps_no_person" # path to store the result
TENSORBOARD_DIR: "/home/xx/Faster-ILOD/incremental_learning_ResNet50_C4/RPN_19_classes_40k_steps_no_person/tensorboard" # path to store tensorboard info

anything wrong?

CanPeng123 commented 3 years ago

If you want to run the non-alphabetical order experiments, you need to go to the dataset files to modify the sequence of the CLASSES. "tvmonitor" is the 20th category for alphabetical order. Other people have run it successfully on the alphabetical order experiments.