facebookresearch / vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
https://vissl.ai
MIT License
3.24k stars 330 forks source link

Low performance of Supervised training #563

Open sidgan opened 1 year ago

sidgan commented 1 year ago

Expected behavior:

If there are no obvious error in "what you observed" provided above, please tell us the expected behavior.

Problem statement: We are using the pretrained imagenet model weights to perform supervised learning on our own dataset, consisting of ~60000 train images and ~14000 test images, there are a total of 1139 classes. I have changed the MLP head in the yaml file to reflect 1139 classes. Expected: Stable training What’s happening? Train accuracy increases too quickly reaching almost 90% in ~170 epochs but the test accuracy doesn’t improve at all, remai log (3).txt ning close to 0 for the most part. While performing supervised training in Pytorch we are able to get 70% accuracy. Any insights on why this might be happening? Suggestions to effectively utilize the VISSL pipelines will be appreciated.

Command:

python3 tools/run_distributed_engines.py hydra.verbose=true config=benchmark/fulltune/imagenet1k/train.yaml config.DATA.TRAIN.DATASET_NAMES=[dummy_data_folder] config.DATA.TRAIN.DATA_SOURCES=[disk_folder] config.DATA.TRAIN.LABEL_SOURCES=[disk_folder] config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=16 config.DATA.TRAIN.DATA_PATHS=["/home/images/train"] config.DATA.TEST.DATA_SOURCES=[disk_folder] config.DATA.TEST.LABEL_SOURCES=[disk_folder] config.DATA.TEST.DATASET_NAMES=[dummy_data_folder] config.DATA.TEST.BATCHSIZE_PER_REPLICA=16 config.DATA.TEST.DATA_PATHS=["/home/images/test"] config.OPTIMIZER.num_epochs=250 config.OPTIMIZER.param_schedulers.lr.values=[0.01,0.001] config.OPTIMIZER.param_schedulers.lr.milestones=[1] config.DISTRIBUTED.NUM_NODES=1 config.DISTRIBUTED.NUM_PROC_PER_NODE=1 config.HOOKS.TENSORBOARD_SETUP.USE_TENSORBOARD=true config.HOOKS.MEMORY_SUMMARY.PRINT_MEMORY_SUMMARY=false config.CHECKPOINT.DIR="/home/new_exp/checkpoint_supervised_2" config.MODEL.WEIGHTS_INIT.PARAMS_FILE="/home/resnet50-19c8e357.pth" config.MODEL.WEIGHTS_INIT.APPEND_PREFIX="trunk._feature_blocks." config.MODEL.WEIGHTS_INIT.STATE_DICT_KEY_NAME=""

Environment:

Provide your environment information using the following command:

wget -nc -q https://github.com/facebookresearch/vissl/raw/main/vissl/utils/collect_env.py && python collect_env.py

sys.platform linux Python 3.6.9 (default, Jun 29 2022, 11:45:57) [GCC 8.4.0] numpy 1.19.5 Pillow 8.4.0 vissl 0.1.6 @/home/vissl/vissl GPU available True GPU 0 Quadro GV100 CUDA_HOME /usr torchvision 0.9.0+cu101 @/home/.local/lib/python3.6/site-packages/torchvision hydra 1.0.7 @/home/.local/lib/python3.6/site-packages/hydra classy_vision 0.7.0.dev @/home/.local/lib/python3.6/site-packages/classy_vision tensorboard 2.9.1 apex 0.1 @/home/.local/lib/python3.6/site-packages/apex cv2 4.6.0 PyTorch 1.8.0+cu101 @/home/.local/lib/python3.6/site-packages/torch PyTorch debug build False


PyTorch built with:

CPU info:


Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian CPU(s) 12 On-line CPU(s) list 0-11 Thread(s) per core 2 Core(s) per socket 6 Socket(s) 1 NUMA node(s) 1 Vendor ID GenuineIntel CPU family 6 Model 85 Model name Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz Stepping 4 CPU MHz 3999.959 CPU max MHz 4000.0000 CPU min MHz 1200.0000 BogoMIPS 6999.82 Virtualization VT-x L1d cache 32K L1i cache 32K L2 cache 1024K L3 cache 8448K NUMA node0 CPU(s) 0-11