AruniRC / detectron-self-train

A PyTorch Detectron codebase for domain adaptation of object detectors.
MIT License
118 stars 21 forks source link

initial weight download #1

Closed liyunsheng13 closed 5 years ago

liyunsheng13 commented 5 years ago

When I try to train the model, I cannot find the file "/mnt/nfs/scratch1/pchakrabarty/bdd_recs/ped_models/bdd_peds.pth" to initialize the model. Could you let me know what it is and where to download it.

PCJohn commented 5 years ago

This is the path to the baseline model. You can download it from this location: http://maxwell.cs.umass.edu/self-train/models/bdd_ped_models/bdd_baseline/bdd_peds.pth

The link is in the table in the "Models" section in the README

liyunsheng13 commented 5 years ago

Do you mean using the baseline model as initialization to train other models? But I find for the baseline model, there is no initialization model in the train script. Is there an issue or you do it somewhere else in the source code. I think for the baseline model, when you train it, at least you need to use the pretrained resnet as initialzation. But I don't find you do it. I trained the baseline model with you code for 70000 iterations and only get 10 mIoU which is worse than the reported result (~15). Do you think it is caused by random initialization?

PCJohn commented 5 years ago

All other models are trained by finetuning the baseline model (starting from the bdd_peds.pth checkpoint) The baseline model is trained "from scratch", but does use the pretrained resnet initialization. You'll have to download this model.

See the section "Download Pretrained Backbone Model" in INSTALL.md: https://github.com/AruniRC/detectron-self-train/blob/master/INSTALL.md

AruniRC commented 5 years ago

@liyunsheng13 it may be a good first step to make sure you have installed everything correctly and the inference demo is working: https://github.com/AruniRC/detectron-self-train#inference-demo

If the demo is working and giving you the expected detection output, then the training scripts should work properly. If there is any further confusion please let us know.

BTW, the line that loads the Imagenet-pretrained Resnet weights for training BDD baseline is in the config YAML: https://github.com/AruniRC/detectron-self-train/blob/master/configs/baselines/bdd100k.yaml#L7

liyunsheng13 commented 5 years ago

Hi, when I use the train script "bdd_source_and_HP18k.sh", I find the NUM_GPU is 1. Is there a type here? I though it would be 4 or 8. If I use 1, I will have the assertion error. Could you me know how many GPUs you use and the batch size per GPU for both "bdd_source_and_HP18k.sh" and the baseline results. It seems that you use 8 gpus with batch size = 1 for the baseline results which let me a little confused.

AruniRC commented 5 years ago

Hi @liyunsheng13 ,

the detectron train_net_step scales the learning rate and other settings based on (a) the number of GPUs available and (b) the NUM_GPU specified in the training config YAML.

When we trained, we kept the YAML unchanged, and set the number of GPUs at run time (this ensures correct learning rates scaling handled internally in the code). On a cluster this is set by the Slurm option --gres GPU:1 for specifying 1 visible GPU. Similarly, when using a local machine, we had to use CUDA_VISIBLE_DEVICES to 1. If this solves your assertion error, let us know, and we will update the README accordingly.

Also, the baseline BDD detector used a standard training pipeline, and we used 4 to 8 GPUs. For all other models (HP, HP-cons etc) we used a single GPU. Note: the config YAMLs are unchanged, only the run time settings are changed.

I'll tag @PCJohn for any additional comments, and confirming that 1 GPU was used when calling the training script on BDD.