Closed wassryan closed 3 years ago
Hi
For the first step running, you also need to modify the following setting on the config file: WEIGHT, NUM_CLASSES, NAME_OLD_CLASSES, NAME_NEW_CLASSES, NAME_EXCLUDED_CLASSES
Normally I use e2e_faster_rcnn_R_50_C4_1x.yaml for first step running and use e2e_faster_rcnn_R_50_C4_1x_Source_model.yaml for source model loading on the following incremental steps.
You can use e2e_faster_rcnn_R_50_C4_1x.yaml as an example for first step running.
Hi, thanks for your reply. I follow the step your mentioned
python tools/train_first_step.py -config-file ./configs/e2e_faster_rcnn_R_50_C4_1x.yaml
and i encountered another error:
Traceback (most recent call last):
File "tools/train_first_step.py", line 233, in <module>
main()
File "tools/train_first_step.py", line 225, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_first_step.py", line 105, in train
arguments,
File "/home/xx/Faster-ILOD/maskrcnn_benchmark/engine/trainer.py", line 72, in do_train
losses = sum(loss for loss in loss_dict.values())
AttributeError: 'tuple' object has no attribute 'values'
what's wrong with the code? have you encounter the same error before?
p.s. I compile the maskrcnn_benchmark in your repo using setup.py in the official maskrcnn_benchmark.
Here is my environment: pytorch 1.1.0 py3.7_cuda9.0.176_cudnn7.5.1_0 pytorch-nightly 1.0.0 torch 1.1.0 torchvision 0.3.0
Hi,
I've localized the problem, it seems that you've changed return parameters as tuple in generalized_rcnn.py, so in do_train i need to get the first element in loss_dict to compute the loss.
Another question:
after run the cmd python tools/train_first_step.py -config-file ./configs/e2e_faster_rcnn_R_50_C4_1x.yaml
, when training, I met another problem, which indicates some label is 20, which exceeds class_logits's 2nd dim. How did this problem happen? Can you check the code you pushed to make sure that the code can run successfully on your server. Thanks!
-> classification_loss = F.cross_entropy(class_logits, labels)
>>> class_logits.shape
torch.Size([512, 20])
>>> labels
tensor([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 20,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 20, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 20, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 20, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 0, 20, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 20, 0,
0, 0, 0, 0, 12, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 20, 12], device='cuda:0')
have you modified the e2e_faster_rcnn_R_50_C4_1x.yaml properly?
Of course, this is my yaml:
MODEL:
META_ARCHITECTURE: "GeneralizedRCNN"
WEIGHT: "catalog://ImageNetPretrained/MSRA/R-50"
BACKBONE:
CONV_BODY: "R-50-C4"
RESNETS:
BACKBONE_OUT_CHANNELS: 1024
RPN:
USE_FPN: False
ANCHOR_STRIDE: (16,)
PRE_NMS_TOP_N_TRAIN: 12000
PRE_NMS_TOP_N_TEST: 6000
POST_NMS_TOP_N_TRAIN: 2000
POST_NMS_TOP_N_TEST: 1000
EXTERNAL_PROPOSAL: False
ROI_HEADS:
USE_FPN: False
ROI_BOX_HEAD:
POOLER_RESOLUTION: 7
POOLER_SCALES: (0.0625,)
POOLER_SAMPLING_RATIO: 2
FEATURE_EXTRACTOR: "ResNet50Conv5ROIFeatureExtractor"
PREDICTOR: "FastRCNNPredictor"
NUM_CLASSES: 20 # total classes : 19 + 1
NAME_OLD_CLASSES: []
NAME_NEW_CLASSES: ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog",
"horse", "motorbike", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]
NAME_EXCLUDED_CLASSES: ["person"]
DATASETS:
TRAIN: ("voc_2007_train", "voc_2007_val")
TEST: ("voc_2007_test",)
DATALOADER:
SIZE_DIVISIBILITY: 32
SOLVER:
BASE_LR: 0.001 # start learning rate
WEIGHT_DECAY: 0.0001
GAMMA: 0.1 # learning rate decay
STEPS: (30000,)
MAX_ITER: 40000 # number of iteration
CHECKPOINT_PERIOD: 2500 # number of iteration to generate check point
IMS_PER_BATCH: 1 # number of images per batch
MOMENTUM: 0.9
TEST: # testing strategy
IMS_PER_BATCH: 1 # number of images per batch
OUTPUT_DIR: "/home/xx/Faster-ILOD/incremental_learning_ResNet50_C4/RPN_19_classes_40k_steps_no_person" # path to store the result
TENSORBOARD_DIR: "/home/xx/Faster-ILOD/incremental_learning_ResNet50_C4/RPN_19_classes_40k_steps_no_person/tensorboard" # path to store tensorboard info
anything wrong?
If you want to run the non-alphabetical order experiments, you need to go to the dataset files to modify the sequence of the CLASSES. "tvmonitor" is the 20th category for alphabetical order. Other people have run it successfully on the alphabetical order experiments.
Hi, guys. It seems your code has some problem, have you check the code before releasing it? For the first step to obatin the teacher model, i run the comand
python tools/train_first_step.py --config-file ./configs/e2e_faster_rcnn_R_50_C4_1x_Source_model.yaml
after modifing the WEIGHT fromxx/incremental_learning_ResNet50_C4/RPN_first_10_classes_40k_steps/model_final.pth
tocatalog://ImageNetPretrained/MSRA/R-50
(I think it's correct if i need to run the code), then I met the problem ofI dig into the code, and localize the problem: /xx/Faster-ILOD/maskrcnn_benchmark/modeling/roi_heads/box_head/loss.py", line 147:
classification_loss = F.cross_entropy(class_logits, labels)
I print out the labels and find there is some value larger than class_logits.size(1)(=11), some value even equals to 15, which is out of index of class_logits.size(1).
Can you guys help me clarify the problem? Thanks!