Open C-SJK opened 3 years ago
I got the same error here. Running on a nvidia container (nvcr.io/nvidia/mxnet:20.12-py3), CUDA 11.1, RTX3090, MXNet 1.8.0 rc:
command CUDA_VISIBLE_DEVICES='0' python -u train.py --network r100 --loss arcface --dataset emore
config info / error messages Called with argument: Namespace(batch_size=128, ckpt=3, ctx_num=1, dataset='emore', frequent=20, image_channel=3, kvstore='device', loss='arcface', lr=0.1, lr_steps='100000,160000,220000', models_root='./models', mom=0.9, network='r100', per_batch_size=128, pretrained='', pretrained_epoch=1, rescale_threshold=0, verbose=2000, wd=0.0005) {'bn_mom': 0.9, 'workspace': 256, 'emb_size': 512, 'ckpt_embedding': True, 'net_se': 0, 'net_act': 'prelu', 'net_unit': 3, 'net_input': 1, 'net_blocks': [1, 4, 6, 2], 'net_output': 'E', 'net_multiplier': 1.0, 'val_targets': ['lfw', 'cfp_fp', 'agedb_30'], 'ce_loss': True, 'fc7_lr_mult': 1.0, 'fc7_wd_mult': 1.0, 'fc7_no_bias': False, 'max_steps': 0, 'data_rand_mirror': True, 'data_cutoff': False, 'data_color': 0, 'data_images_filter': 0, 'count_flops': True, 'memonger': False, 'loss_name': 'margin_softmax', 'loss_s': 64.0, 'loss_m1': 1.0, 'loss_m2': 0.5, 'loss_m3': 0.0, 'net_name': 'fresnet', 'num_layers': 100, 'dataset': 'emore', 'dataset_path': '../datasets/faces_emore', 'num_classes': 85742, 'image_shape': [112, 112, 3], 'loss': 'arcface', 'network': 'r100', 'num_workers': 1, 'batch_size': 128, 'per_batch_size': 128} 0 1 E 3 prelu False Network FLOPs: 24.2G
INFO:root:loading recordio ../datasets/faces_emore/train.rec...
header0 label [5822654. 5908396.]
id2range 85742
5822653
rand_mirror True
[14:34:50] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
(12000, 3, 112, 112)
ver lfw
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
(14000, 3, 112, 112)
ver cfp_fp
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
(12000, 3, 112, 112)
ver agedb_30
lr_steps [100000, 160000, 220000]
call reset()
[14:35:20] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU
/opt/mxnet/python/mxnet/module/base_module.py:504: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.0078125). Is this intended?
self.init_optimizer(kvstore=kvstore, optimizer=optimizer,
Traceback (most recent call last):
File "train.py", line 483, in
EDIT Got the same error with another version of nvidia docker image (20.08), with MXNet 1.6.0 and CUDA 11.0.
the same error
/opt/mxnet/python/mxnet/module/base_module.py:505: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (0.25 vs. 0.001953125). Is this intended?
optimizer_params=optimizer_params)
[12:45:41] ../src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:120: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Traceback (most recent call last):
File "train.py", line 484, in
Has anyone been able to resolve this?
When I was training, I encountered the following error。MxNet version is 1.6.0 This is my config info:
Called with argument: Namespace(batch_size=128, ckpt=3, ctx_num=1, dataset='emore', frequent=20, image_channel=3, kvstore='device', loss='softmax', lr=0.1, lr_steps='100000,160000,220000', models_root='./models', mom=0.9, network='m1', per_batch_size=128, pretrained='', pretrained_epoch=1, rescale_threshold=0, verbose=2000, wd=0.0005) {'bn_mom': 0.9, 'workspace': 256, 'emb_size': 256, 'ckpt_embedding': True, 'net_se': 0, 'net_act': 'prelu', 'net_unit': 3, 'net_input': 1, 'net_blocks': [1, 4, 6, 2], 'net_output': 'GDC', 'net_multiplier': 1.0, 'val_targets': ['lfw'], 'ce_loss': True, 'fc7_lr_mult': 1.0, 'fc7_wd_mult': 1.0, 'fc7_no_bias': False, 'max_steps': 0, 'data_rand_mirror': True, 'data_cutoff': False, 'data_color': 0, 'data_images_filter': 0, 'count_flops': True, 'memonger': False, 'loss_name': 'softmax', 'net_name': 'fmobilenet', 'dataset': 'emore', 'dataset_path': '../datasets/faces_emore', 'num_classes': 85742, 'image_shape': [112, 112, 3], 'loss': 'softmax', 'network': 'm1', 'num_workers': 1, 'batch_size': 128, 'per_batch_size': 128}
The error info: Traceback (most recent call last): File "train.py", line 483, in
main()
File "train.py", line 479, in main
train_net(args)
File "train.py", line 473, in train_net
epoch_end_callback=epoch_cb)
File "/opt/mxnet/python/mxnet/module/base_module.py", line 536, in fit
self.update_metric(eval_metric, data_batch.label, False, data_batch.pad)
File "/opt/mxnet/python/mxnet/module/module.py", line 777, in update_metric
self._exec_group.update_metric(eval_metric, labels, pre_sliced, label_pads)
File "/opt/mxnet/python/mxnet/module/executor_group.py", line 661, in update_metric
out.shape[0], islice_batch_size)
AssertionError: output length 1 not a multiple of slice batch_size 128