Issues in running the script

XiongweiWu commented 2 years ago

Hi, when I run your (demo script), ./tools/dist_test.sh configs/lvis/cascade_mask_rcnn_r50_fpn_sample1e-3_mstrain_20e_lvis_v1_pretrain_ens.py data/models/epoch_20.pth 8 --eval bbox segm --cfg-options model.roi_head.prompt_path=data/prompt/iou_neg5_ens.pth model.roi_head.load_feature=False

one error reports: FileNotFoundError: [Errno 2] No such file or directory: 'data/lvis_v1/proposals/rpn_r101_fpn_lvis_val.pkl'

I check the file you provided and it seems the precomputed results of lvis validation set is not provided. Can u help check the file or check whether the script is correct or not?

dyabel commented 2 years ago

Hi, when I run your (demo script), ./tools/dist_test.sh configs/lvis/cascade_mask_rcnn_r50_fpn_sample1e-3_mstrain_20e_lvis_v1_pretrain_ens.py data/models/epoch_20.pth 8 --eval bbox segm --cfg-options model.roi_head.prompt_path=data/prompt/iou_neg5_ens.pth model.roi_head.load_feature=False

one error reports: FileNotFoundError: [Errno 2] No such file or directory: 'data/lvis_v1/proposals/rpn_r101_fpn_lvis_val.pkl'

I check the file you provided and it seems the precomputed results of lvis validation set is not provided. Can u help check the file or check whether the script is correct or not?

Sorry, I have forgot to upload it to the google drive while it exists in the baiduyun link. I will update later.

XiongweiWu commented 2 years ago

@dyabel Thx for quick reply. It seems that your 4-length Baidu code is also incorrect.

dyabel commented 2 years ago

@dyabel Thx for quick reply. It seems that your 4-length Baidu code is also incorrect.

It is correct, I have just tried. Please check again or return to me if you still have problem.

XiongweiWu commented 2 years ago

@dyabel I just tried the Baidu pan again and it works now. It's a bit strange that I cannot access it yesterday... Anyway thx for your update!

XiongweiWu commented 2 years ago

@dyabel A quick update of the issues when I run your demo script.

When I run the script:

./tools/dist_test.sh configs/lvis/cascade_mask_rcnn_r50_fpn_sample1e-3_mstrain_20e_lvis_v1_pretrain_ens.py data/models/epoch_20.pth 8 --eval bbox segm --cfg-options model.roi_head.prompt_path=./data/prompt/iou_neg5_ens.pth model.roi_head.load_feature=False

one error reports: File "./tools/test.py", line 212, in
main()
File "./tools/test.py", line 189, in main
args.gpu_collect)
File "/opt/xwwu/wkplace/xwwu-vtdet/detpro/mmdet/apis/test.py", line 95, in multi_gpu_test
for i, data in enumerate(data_loader):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 355, in iter
return self._get_iterator()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 301, in _get_iterator return _MultiProcessingDataLoaderIter(self)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 914, in init w.start()
File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj) TypeError: can't pickle _thread.RLock objects

When I run the script: bash prepare.sh

one error reports: OSError: current_mmdetection_Head.pth is not a checkpoint file It seems the setting of config file and the inference scheme is inconsistent.

I have not read your code comprehensively and just try your provided script and press the button. So I wonder whether you can check these issues?

My environment: Torch: 1.8.1 mmcv-full: 1.2.5 cuda: 10.2 GPU Type: 32G V-100 Python: 3.6.9

dyabel commented 2 years ago

@dyabel A quick update of the issues when I run your demo script.

When I run the script:

./tools/dist_test.sh configs/lvis/cascade_mask_rcnn_r50_fpn_sample1e-3_mstrain_20e_lvis_v1_pretrain_ens.py data/models/epoch_20.pth 8 --eval bbox segm --cfg-options model.roi_head.prompt_path=./data/prompt/iou_neg5_ens.pth model.roi_head.load_feature=False

one error reports: File "./tools/test.py", line 212, in main() File "./tools/test.py", line 189, in main args.gpu_collect) File "/opt/xwwu/wkplace/xwwu-vtdet/detpro/mmdet/apis/test.py", line 95, in multi_gpu_test for i, data in enumerate(data_loader): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 355, in iter return self._get_iterator() File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 301, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 914, in init w.start() File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start self._popen = self._Popen(self) File "/usr/lib/python3.6/multiprocessing/context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: can't pickle _thread.RLock objects

When I run the script: bash prepare.sh

one error reports: OSError: current_mmdetection_Head.pth is not a checkpoint file It seems the setting of config file and the inference scheme is inconsistent.

I have not read your code comprehensively and just try your provided script and press the button. So I wonder whether you can check these issues?

My environment: Torch: 1.8.1 mmcv-full: 1.2.5 cuda: 10.2 GPU Type: 32G V-100 Python: 3.6.9

The first problem is due to python3.6, I have met with this same issue before. There is no problem for 3.8 which I have tested. For the second problem, you should download the corresponding file and put it under the project root dir. I have provided the download link, please check readme.

XiongweiWu commented 2 years ago

@dyabel Thx for quick reply. I will double check it base on your feedback.

dyabel commented 2 years ago

@dyabel Thx for quick reply. I will double check it base on your feedback.

Sorry for inconvenience. There are still many potential problems，I will run the whole process myself again in the next one or two weeks.

XiongweiWu commented 2 years ago

@dyabel Personally speaking your codebase can really benefit the related researchers a lot (since there is few OV Detection related project based on PyTorch). Maybe you can update the code if you have made any progress.

Feobi1999 commented 2 years ago

Hi, I have downloaded the prompt, checkpoint, and SoCo pretrained weights and place them in right path. But I get a very low inference result. The logs are below and could you find something wrong?

The bash command:

MASTER_PORT=71999 GPUS=8 GPUS_PER_NODE=8 ./tools/slurm_test.sh partition_V100 my_task configs/lvis/detpro_ens_20e.py /mnt/lustre/hemengzhe/pretrained_model/epoch_20.pth --eval bbox segm --cfg-options model.roi_head.prompt_path=/mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth model.roi_head.load_feature=False

PARTITION=VBT_V100
JOB_NAME=hmz_task
CONFIG=configs/lvis/detpro_ens_20e.py
CHECKPOINT=/mnt/lustre/hemengzhe/pretrained_model/epoch_20.pth
GPUS=8
GPUS_PER_NODE=8
CPUS_PER_TASK=5
PY_ARGS='--eval bbox segm --cfg-options model.roi_head.prompt_path=/mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth model.roi_head.load_feature=False'
SRUN_ARGS=
++ dirname ./tools/slurm_test.sh
PYTHONPATH=./tools/..:
srun -p VBT_V100 --job-name=hmz_task --gres=gpu:8 --ntasks=8 --ntasks-per-node=8 --cpus-per-task=5 --kill-on-bad-exit=1 python -u tools/test.py configs/lvis/detpro_ens_20e.py /mnt/lustre/hemengzhe/pretrained_model/epoch_20.pth --launcher=slurm --eval bbox segm --cfg-options model.roi_head.prompt_path=/mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth model.roi_head.load_feature=Fals e
phoenix-srun: job 490360 queued and waiting for resources
phoenix-srun: job 490360 has been allocated resources
phoenix-srun: Job 490360 scheduled successfully!
Current QUOTA_TYPE is [reserved], which means the job has occupied quota in RESERVED_TOTAL under your partition.
Current PHX_PRIORITY is normal

/mnt/cache/hemengzhe/CLIP/clip/clip.py:24: UserWarning: PyTorch version 1.7.1 or higher is recommended
warnings.warn("PyTorch version 1.7.1 or higher is recommended")
/mnt/cache/hemengzhe/CLIP/clip/clip.py:24: UserWarning: PyTorch version 1.7.1 or higher is recommended
warnings.warn("PyTorch version 1.7.1 or higher is recommended")
/mnt/cache/hemengzhe/CLIP/clip/clip.py:24: UserWarning: PyTorch version 1.7.1 or higher is recommended
warnings.warn("PyTorch version 1.7.1 or higher is recommended")
/mnt/cache/hemengzhe/CLIP/clip/clip.py:24: UserWarning: PyTorch version 1.7.1 or higher is recommended
warnings.warn("PyTorch version 1.7.1 or higher is recommended")
/mnt/cache/hemengzhe/CLIP/clip/clip.py:24: UserWarning: PyTorch version 1.7.1 or higher is recommended
warnings.warn("PyTorch version 1.7.1 or higher is recommended")
/mnt/cache/hemengzhe/CLIP/clip/clip.py:24: UserWarning: PyTorch version 1.7.1 or higher is recommended
warnings.warn("PyTorch version 1.7.1 or higher is recommended")
/mnt/cache/hemengzhe/CLIP/clip/clip.py:24: UserWarning: PyTorch version 1.7.1 or higher is recommended
warnings.warn("PyTorch version 1.7.1 or higher is recommended")
/mnt/cache/hemengzhe/CLIP/clip/clip.py:24: UserWarning: PyTorch version 1.7.1 or higher is recommended
warnings.warn("PyTorch version 1.7.1 or higher is recommended")
num_classes: 1203
num_classes: 1203
num_classes: 1203
num_classes: 1203
num_classes: 1203
num_classes: 1203
num_classes: 1203
load_feature False
use_clip_inference False
prompt 2
fixed_lambda None
prompt path /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
load_feature False
use_clip_inference False
prompt 2
fixed_lambda None
prompt path /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
load_feature False
use_clip_inference False
prompt 2
fixed_lambda None
prompt path /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
load_feature False
use_clip_inference False
prompt 2
fixed_lambda None
prompt path /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
load_feature False
use_clip_inference False
prompt 2
fixed_lambda None
prompt path /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
load_feature False
use_clip_inference False
prompt 2
fixed_lambda None
prompt path /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
load_feature False
use_clip_inference False
prompt 2
fixed_lambda None
prompt path /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
num_classes: 1203
load_feature False
use_clip_inference False
prompt 2
fixed_lambda None
prompt path /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
ensemble:True
load: /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
ensemble:True
load: /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
ensemble:True
load: /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
ensemble:True
load: /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
load_feature False
use_clip_inference False
prompt 2
fixed_lambda None
prompt path /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
num_classes: 1203
load_feature False
use_clip_inference False
prompt 2
fixed_lambda None
prompt path /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
ensemble:True
load: /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
ensemble:True
load: /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
ensemble:True
load: /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
ensemble:True
load: /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth
text embedding finished, 0.07587862014770508 passed torch.Size([1203, 512]) ensemble:True load: /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth ensemble:True load: /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth ensemble:True load: /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth ensemble:True load: /mnt/lustre/hemengzhe/prompt/iou_neg5_ens.pth text embedding finished, 4.252140760421753 passed torch.Size([1203, 512]) text embedding finished, 3.6612765789031982 passed torch.Size([1203, 512]) text embedding finished, 3.3516972064971924 passed torch.Size([1203, 512]) text embedding finished, 2.7191855907440186 passed torch.Size([1203, 512]) text embedding finished, 3.0662219524383545 passed torch.Size([1203, 512]) text embedding finished, 3.305469274520874 passed torch.Size([1203, 512]) text embedding finished, 2.7771387100219727 passed torch.Size([1203, 512]) The model and loaded state dict do not match exactly

unexpected key in source state_dict: roi_head.clip_model.input_resolution, roi_head.clip_model.context_length, roi_head.clip_model.vocab_size [>>>>>>>>>>>>>>>>>>>>>>>> ] 160/200, 3.7 task/s, elapsed: 43s, ETA: 11s/mnt/cache/hemengzhe/CLIP/clip/clip.py:24: UserWarning: PyTorch version 1.7.1 or higher is recommended Evaluating bbox... Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=300 catIds=all] = 0.004 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=300 catIds=all] = 0.005 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=300 catIds=all] = 0.004 Average Precision (AP) @[ IoU=0.50:0.95 | area= s | maxDets=300 catIds=all] = 0.004 Average Precision (AP) @[ IoU=0.50:0.95 | area= m | maxDets=300 catIds=all] = 0.004 Average Precision (AP) @[ IoU=0.50:0.95 | area= l | maxDets=300 catIds=all] = 0.003 Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=300 catIds= r] = 0.003 Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=300 catIds= c] = 0.003 Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=300 catIds= f] = 0.005 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 catIds=all] = 0.003 Average Recall (AR) @[ IoU=0.50:0.95 | area= s | maxDets=300 catIds=all] = 0.003 Average Recall (AR) @[ IoU=0.50:0.95 | area= m | maxDets=300 catIds=all] = 0.004 Average Recall (AR) @[ IoU=0.50:0.95 | area= l | maxDets=300 catIds=all] = 0.003

dyabel commented 2 years ago

Just comment out L754 in mmdet/datasets/lvis.py and it will be ok. The number of validation images is supposed to be 19809. 200 is just for debug.

XiongweiWu commented 2 years ago

@dyabel Hi, I just start the detpro training (bash detpro.sh), and I find one bug reports.

In your dataset preparation step, only the positive samples with IoU >= 0.5 with the gts are selected (According to LVIS dataset, the label of bg is -1, and I also check the train_data.pth generated from prepare.sh, with minimum iou as 0.5): https://github.com/dyabel/detpro/blob/0f7486d0ddeb22798e276f41128abf6c0d9cf817/configs/_base_/models/mask_rcnn_r50_fpn.py#L101 https://github.com/dyabel/detpro/blob/dd1508f31ffa8d359f6400bf91c9df50123d5281/mmdet/models/roi_heads/standard_roi_head_collect.py#L435 while in your detpro training step, the negative samples with IoU < 0.5 are required to train the model, however, we have not prepared negative samples, which leads to the running time error: https://github.com/dyabel/detpro/blob/dd1508f31ffa8d359f6400bf91c9df50123d5281/prompt/run.py#L165 Can u check and fix it?

dyabel commented 2 years ago

@dyabel Hi, I just start the detpro training (bash detpro.sh), and I find one bug reports.

In your dataset preparation step, only the positive samples with IoU >= 0.5 with the gts are selected (According to LVIS dataset, the label of bg is -1, and I also check the train_data.pth generated from prepare.sh, with minimum iou as 0.5):

https://github.com/dyabel/detpro/blob/dd1508f31ffa8d359f6400bf91c9df50123d5281/mmdet/models/roi_heads/standard_roi_head_collect.py#L435

while in your detpro training step, the negative samples with IoU < 0.5 are required to train the model, however, we have not prepared negative samples, which leads to the running time error: https://github.com/dyabel/detpro/blob/dd1508f31ffa8d359f6400bf91c9df50123d5281/prompt/run.py#L165

Can u check and fix it?

Thank you for pointing it out, I have just updated, please check. I got my code mixed. Before I run experiments on the server of MSRA and the code is a little different from local.

XiongweiWu commented 2 years ago

@dyabel Thx for quick reply, I will retrain the model. BTW, as pointed by other researchers, the config file of your reimplement VilD is the same as detpro. Thus how can we obtain the pure VILD based on your codebase?

XiongweiWu commented 2 years ago

@dyabel

I have just finished the whole training stage, and I meet several severe problems which need your help.

I fail to reproduce the results

I follow the most-updated scripts and train the model (Env: torch-1.7.0, cuda-10.2, python-3.8.5), and fail to reproduce the results. The reported number and the reproduced number are listed below:

Report: AP:0.284 APr:0.208 APc:0.278 APf:0.324, Reproduce: AP:0.192 APr:0.110 APc:0.185 APf:0.235

bbox_mAP_copypaste: AP:0.192 AP50:0.312 AP75:0.197 APs:0.136 APm:0.258 APl:0.309 APr:0.110 APc:0.185 APf:0.235, segm_mAP_copypaste: AP:0.178 AP50:0.282 AP75:0.184 APs:0.122 APm:0.246 APl:0.295 APr:0.108 APc:0.175 APf:0.213

The only thing I change is in your detpro training stage, https://github.com/dyabel/detpro/blob/0f7486d0ddeb22798e276f41128abf6c0d9cf817/prompt/run.py#L202 I use the base class as val_db2 since we do not prepare novel class here and it leads to running time error. The val2 is served as the evaluation set, so I think this modification should not affect the final training results.

The training cost is extremely high, and to my surprise, the speed is almost the same during vild training stage. The training time based on V-100 (0.802s per iteration) and A-100 (0.732s per iteration) is almost identic. However, A-100 should be much powerful than V-100.

The whole training time in my side is about: prepare.sh: 48 hours detpro.sh: 12 hours vild.sh: 48 hours + 30 hours

I do not fully understand the motivation to extract the clip image embedding before the vild training. https://github.com/dyabel/detpro/blob/0f7486d0ddeb22798e276f41128abf6c0d9cf817/vild_detpro.sh#L2 In the preparation stage, we have already extracted the proposal features (though only the proposals with IoU>0.1 with GT are selected). This step is the most time consuming part and I am a bit confused here.

dyabel commented 2 years ago

@dyabel

I have just finished the whole training stage, and I meet several severe problems which need your help.

I fail to reproduce the results

I follow the most-updated scripts and train the model (Env: torch-1.7.0, cuda-10.2, python-3.8.5), and fail to reproduce the results. The reported number and the reproduced number are listed below:

Report: AP:0.284 APr:0.208 APc:0.278 APf:0.324, Reproduce: AP:0.192 APr:0.110 APc:0.185 APf:0.235

bbox_mAP_copypaste: AP:0.192 AP50:0.312 AP75:0.197 APs:0.136 APm:0.258 APl:0.309 APr:0.110 APc:0.185 APf:0.235, segm_mAP_copypaste: AP:0.178 AP50:0.282 AP75:0.184 APs:0.122 APm:0.246 APl:0.295 APr:0.108 APc:0.175 APf:0.213

The only thing I change is in your detpro training stage,

https://github.com/dyabel/detpro/blob/0f7486d0ddeb22798e276f41128abf6c0d9cf817/prompt/run.py#L202

I use the base class as val_db2 since we do not prepare novel class here and it leads to running time error. The val2 is served as the evaluation set, so I think this modification should not affect the final training results.

The training cost is extremely high, and to my surprise, the speed is almost the same during vild training stage. The training time based on V-100 (0.802s per iteration) and A-100 (0.732s per iteration) is almost identic. However, A-100 should be much powerful than V-100.

I do not fully understand the motivation to extract the clip image embedding before the vild training. https://github.com/dyabel/detpro/blob/0f7486d0ddeb22798e276f41128abf6c0d9cf817/vild_detpro.sh#L2

In the preparation stage, we have already extracted the proposal features (though only the proposals with IoU>0.1 with GT are selected). This step is the most time consuming part and I am a bit confused here.

Maybe you can try to run with the text embedding I provide first. The training cost is mainly attributed to the clip forward pass, that is why we extract clip image embedding before training and load it during training (so we just need to do the clip forward process only once for each image). The whole extracting process takes about one day and the training process takes less than two days, if you use 8 V100s and batchsize of 16. Btw, the zip process may also take about several hours.

XiongweiWu commented 2 years ago

@dyabel I think your current implementation is highly different, in prepare.sh, the proposals are first fed into the clip image encoder (stored in data/lvis_clip_image_proposal_embedding) , while in vild_detpro.sh, the proposals will be fed again (stored in data/lvis_clip_image_embedding and then get zipped. I have checked the code and it seems duplicated).

BTW, can u tell me how to run with your text embedding? Or maybe you can provide the script in your readme file.

I have provide the training time in my side with DGX-A100, and other researchers may also report their training time here: prepare.sh: 40 hours (9.6s per iteration, extracting proposal features) detpro.sh: 12 hours (tune detpro, this part is fast) vild_detpro.sh: 40 hours + 30 hours (9.6s per iteration for proposal extraction, 0.75s per iteration for vild training, with 20 epochs)

totally about 122 hours

dyabel commented 2 years ago

@dyabel I think current implementation is a bit different, in prepare.sh, the proposals are first fed into the clip image encoder (stored in data/lvis_clip_image_proposal_embedding) , while in vild_detpro.sh, the proposals will be fed again (stored in data/lvis_clip_image_embedding and then get zipped. I have checked the code and it seems duplicated).

BTW, can u tell me how to run with your text embedding?

Yes, It is duplicated. I will merge these processes later. Just for referece, you can run with ./tools/dist_train.sh configs/lvis/detpro_ens_20e.py 8 --work-dir workdirs/vild_ens_20e_detpro --cfg-options model.roi_head.prompt_path=iou_neg5_ens.pth model.roi_head.load_feature=True Please get lvis_clip_image_embedding.zip first and put under data/. Make sure that current_mmdetection_Head.pth is also under data/, I have changed the loading path. The link of iou_neg5_ens.pth is provided in README.

XiongweiWu commented 2 years ago

@dyabel Thx for reply. Can you also provide lvis_clip_image_embedding.zip in readme? I think this step is extremely time consuming and it should be irrelevant to the model training (just use the fixed clip to encode the precomputed proposals)

dyabel commented 2 years ago

@dyabel Thx for reply. Can you also provide lvis_clip_image_embedding.zip in readme? I think this step is extremely time consuming and it should be irrelevant to the model (just use the fixed clip to encode the precomputed proposals)

Sorry, I have tried but failed. The file is too large (~170G). I cannot find a way to upload.

XiongweiWu commented 2 years ago

@dyabel Maybe you can try Baidu pan (I remember Baidu pan has 10T size), and use subsection compression to divide the original zip file into 10 sub files.

XiongweiWu commented 2 years ago

The results can now be reproduced (though the code can still be further optimized).

felixfuu commented 2 years ago

@XiongweiWu @dyabel Can you provide the log of the detpro.sh? The loss did not drop(~4.8-6) when I was training the detpro.sh(fg_bg_5_5_6_end), and the val results were also poor.

felixfuu commented 2 years ago

fg_bg_5_5_6_end epoch 1 result: train acc: top1=0.10497259697366079 top5=0.20694614266118638 total=5343023176689148 test acc: top1=0.1861130838568697 top5=0.3810579987253027 total=274575 avg_score: 0.08943805910270418 avg_var: 2.8050976742554544e-05 entropy: 5.762784303013748 test acc: top1=0.17486752460257382 top5=0.3822861468584406 total=1321 avg_score: 0.09404259103432829 avg_var: 3.453896837104549e-05 entropy: 5.562931727857683 test neg(thr=0.5): pos=13813 total=1052140 avg_score: 0.07717330125981334 avg_var: 2.1422017422132475e-05 entropy: 5.895488243009486 test neg(thr=0.9): pos=508 total=1052140 avg_score: 0.07717327898378543 avg_var: 2.1422008357993516e-05 entropy: 5.895485866899842

XiongweiWu commented 2 years ago

@felixfuu I have not recorded the log file since there is no implementation of current version. But I can remember the avg acc is much higher than this number (7.7%) and you can find the acc improves during training.

felixfuu commented 2 years ago

@felixfuu I have not recorded the log file since there is no implementation of current version. But I can remember the avg acc is much higher than this number (7.7%) and you can find the acc improves during training.

epoch 1 or final results? 7.7% is my epoch 1 result.

XiongweiWu commented 2 years ago

@felixfuu I just run a quick exp and the initial score is ~7.1%, and after 1-epoch training the acc reaches 11.3%. I use the old code which can re-produce the results in my side so the log-format may be slightly different.

Detailed information: Init train info: 2335610 0 2311910
val info: 1249969 0 4519547
test acc: top1=0.15116134880145027 top5=0.31875590514644764 total=1249969
avg_score: 0.0709989607742272 avg_var: 1.8582240319540393e-05 entropy: 6.228120057377423 test acc: top1=0.15116134880145027 top5=0.31875590514644764 total=1249969 avg_score: 0.07099902952593225 avg_var: 1.858224947504089e-05 entropy: 6.228124857496466

epoch1
train mode : soft
embbeding shape : torch.Size([866, 512])
train acc: top1=0.14238776809997591 top5=0.2667306434399422 total=4647520 time=15.421600341796875 test acc: top1=0.24820055537377328 top5=0.4790166796136544 total=1249969 avg_score: 0.11310364247033326 avg_var: 3.6485908512640485e-05 entropy: 5.579813179366848
test acc: top1=0.24820055537377328 top5=0.4790166796136544 total=1249969 avg_score: 0.11310375497312333 avg_var: 3.64858963053065e-05 entropy: 5.579811579327167

felixfuu commented 2 years ago

@felixfuu I just run a quick exp and the initial score is ~7.1%, and after 1-epoch training the acc reaches 11.3%. I use the old code which can re-produce the results in my side so the log-format may be slightly different.

Detailed information: Init train info: 2335610 0 2311910 val info: 1249969 0 4519547 test acc: top1=0.15116134880145027 top5=0.31875590514644764 total=1249969 avg_score: 0.0709989607742272 avg_var: 1.8582240319540393e-05 entropy: 6.228120057377423 test acc: top1=0.15116134880145027 top5=0.31875590514644764 total=1249969 avg_score: 0.07099902952593225 avg_var: 1.858224947504089e-05 entropy: 6.228124857496466

epoch1 train mode : soft embbeding shape : torch.Size([866, 512]) train acc: top1=0.14238776809997591 top5=0.2667306434399422 total=4647520 time=15.421600341796875 test acc: top1=0.24820055537377328 top5=0.4790166796136544 total=1249969 avg_score: 0.11310364247033326 avg_var: 3.6485908512640485e-05 entropy: 5.579813179366848 test acc: top1=0.24820055537377328 top5=0.4790166796136544 total=1249969 avg_score: 0.11310375497312333 avg_var: 3.64858963053065e-05 entropy: 5.579811579327167

thx~

felixfuu commented 2 years ago

@XiongweiWu Could you please provide the final result(epoch 6)?

dyabel / detpro

Issues in running the script #1