Open nuist-xinyu opened 5 years ago
@Duankaiwen 谢谢小哥哥
Can I see your full log?
loading all datasets...
using 4 threads
loading from cache file: cache/coco_trainval2014.pkl
No cache file found...
loading annotations into memory...
Done (t=25.19s)
creating index...
index created!
118287it [01:18, 1509.02it/s]
loading annotations into memory...
Done (t=20.50s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=18.08s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=20.29s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=23.57s)
creating index...
index created!
loading from cache file: cache/coco_minival2014.pkl
No cache file found...
loading annotations into memory...
Done (t=1.26s)
creating index...
index created!
5000it [00:03, 1478.28it/s]
loading annotations into memory...
Done (t=0.61s)
creating index...
index created!
system config...
{'batch_size': 48,
'cache_dir': 'cache',
'chunk_sizes': [6, 6, 6, 6, 6, 6, 6, 6],
'config_dir': 'config',
'data_dir': './data',
'data_rng': <mtrand.RandomState object at 0x7fac46fc3870>,
'dataset': 'MSCOCO',
'decay_rate': 10,
'display': 5,
'learning_rate': 0.00025,
'max_iter': 480000,
'nnet_rng': <mtrand.RandomState object at 0x7fac46fc38b8>,
'opt_algo': 'adam',
'prefetch_size': 6,
'pretrain': None,
'result_dir': 'results',
'sampling_function': 'kp_detection',
'snapshot': 5000,
'snapshot_name': 'CenterNet-104',
'stepsize': 450000,
'test_split': 'testdev',
'train_split': 'trainval',
'val_iter': 500,
'val_split': 'minival',
'weight_decay': False,
'weight_decay_rate': 1e-05,
'weight_decay_type': 'l2'}
db config...
{'ae_threshold': 0.5,
'border': 128,
'categories': 80,
'data_aug': True,
'gaussian_bump': True,
'gaussian_iou': 0.7,
'gaussian_radius': -1,
'input_size': [511, 511],
'kp_categories': 1,
'lighting': True,
'max_per_image': 100,
'merge_bbox': False,
'nms_algorithm': 'exp_soft_nms',
'nms_kernel': 3,
'nms_threshold': 0.5,
'output_sizes': [[128, 128]],
'rand_color': True,
'rand_crop': True,
'rand_pushes': False,
'rand_samples': False,
'rand_scale_max': 1.4,
'rand_scale_min': 0.6,
'rand_scale_step': 0.1,
'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]),
'special_crop': False,
'test_scales': [1],
'top_k': 70,
'weight_exp': 8}
len of db: 118287
start prefetching data...
start prefetching data...
shuffling indices...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
building model...
module_file: models.CenterNet-104
start prefetching data...
shuffling indices...
total parameters: 210062960
setting learning rate to: 0.00025
training start...
0%| | 0/480000 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 203, in
How many GPUs do you have?
i have put the val into the training set as you said,but this erro occurred thank you help me @Duankaiwen
16G
How many GPUs, not GPU memory
sorrysorry 8g
only one ,2070
Modify 'batch_size' to 3 and 'chunk_sizes' to [3] in config/CenterNet-104.json. If out of memory, then modify 'batch_size' to 2 and 'chunk_sizes' to [2]
thank you
best wish for you i have try it
Traceback (most recent call last):
File "train.py", line 203, in
Ah, I am crazy now. Still not, God, help me.
Can I see your config/CenterNet-104.json?
{ "system": { "dataset": "MSCOCO", "batch_size": 48, "sampling_function": "kp_detection",
"train_split": "trainval",
"val_split": "minival",
"learning_rate": 0.00025,
"decay_rate": 10,
"val_iter": 500,
"opt_algo": "adam",
"prefetch_size": 6,
"max_iter": 480000,
"stepsize": 450000,
"snapshot": 5000,
"chunk_sizes": [6,6,6,6,6,6,6,6],
"data_dir": "./data"
},
"db": {
"rand_scale_min": 0.6,
"rand_scale_max": 1.4,
"rand_scale_step": 0.1,
"rand_scales": null,
"rand_crop": true,
"rand_color": true,
"border": 128,
"gaussian_bump": true,
"input_size": [511, 511],
"output_sizes": [[128, 128]],
"test_scales": [1],
"top_k": 70,
"categories": 80,
"kp_categories": 1,
"ae_threshold": 0.5,
"nms_threshold": 0.5,
"max_per_image": 100
}
}
I showed you the original, I also modified it.But still can't do you know chinese
Can I see your own config/CenterNet-104.json?
I don't have my own config, this is my own download.
This is what I used for training.
You said you have modified it the config/CenterNet-104.json, and I want to know what does the modified file look like. The log shows that there are some errors in config/CenterNet-104.json. I need to know the detail of config/CenterNet-104.json to help you
{ "system": { "dataset": "MSCOCO", "batch_size": 2, "sampling_function": "kp_detection",
"train_split": "trainval",
"val_split": "minival",
"learning_rate": 0.00025,
"decay_rate": 10,
"val_iter": 500,
"opt_algo": "adam",
"prefetch_size": 6,
"max_iter": 480000,
"stepsize": 450000,
"snapshot": 5000,
"chunk_sizes": [2,2,2,2,2,2,2,2],
"data_dir": "./data"
},
"db": {
"rand_scale_min": 0.6,
"rand_scale_max": 1.4,
"rand_scale_step": 0.1,
"rand_scales": null,
"rand_crop": true,
"rand_color": true,
"border": 128,
"gaussian_bump": true,
"input_size": [511, 511],
"output_sizes": [[128, 128]],
"test_scales": [1],
"top_k": 70,
"categories": 80,
"kp_categories": 1,
"ae_threshold": 0.5,
"nms_threshold": 0.5,
"max_per_image": 100
}
}
This is what I changed after I modified it.
Modify 'chunk_sizes' to [2], not [2,2,2,2,2,2,2,2]
ok ok 谢谢你啊
You are really enthusiastic, thank you.
loading all datasets... using 4 threads loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=24.58s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=19.16s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=18.12s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=23.61s) creating index... index created! loading from cache file: cache/coco_minival2014.pkl loading annotations into memory... Done (t=0.78s) creating index... index created! system config... {'batch_size': 2, 'cache_dir': 'cache', 'chunk_sizes': [2], 'config_dir': 'config', 'data_dir': './data', 'data_rng': <mtrand.RandomState object at 0x7f75971ee900>, 'dataset': 'MSCOCO', 'decay_rate': 10, 'display': 5, 'learning_rate': 0.00025, 'max_iter': 480000, 'nnet_rng': <mtrand.RandomState object at 0x7f75971ee948>, 'opt_algo': 'adam', 'prefetch_size': 6, 'pretrain': None, 'result_dir': 'results', 'sampling_function': 'kp_detection', 'snapshot': 5000, 'snapshot_name': 'CenterNet-104', 'stepsize': 450000, 'test_split': 'testdev', 'train_split': 'trainval', 'val_iter': 500, 'val_split': 'minival', 'weight_decay': False, 'weight_decay_rate': 1e-05, 'weight_decay_type': 'l2'} db config... {'ae_threshold': 0.5, 'border': 128, 'categories': 80, 'data_aug': True, 'gaussian_bump': True, 'gaussian_iou': 0.7, 'gaussian_radius': -1, 'input_size': [511, 511], 'kp_categories': 1, 'lighting': True, 'max_per_image': 100, 'merge_bbox': False, 'nms_algorithm': 'exp_soft_nms', 'nms_kernel': 3, 'nms_threshold': 0.5, 'output_sizes': [[128, 128]], 'rand_color': True, 'rand_crop': True, 'rand_pushes': False, 'rand_samples': False, 'rand_scale_max': 1.4, 'rand_scale_min': 0.6, 'rand_scale_step': 0.1, 'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]), 'special_crop': False, 'test_scales': [1], 'top_k': 70, 'weight_exp': 8} len of db: 118287 start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... shuffling indices... building model... module_file: models.CenterNet-104 start prefetching data... shuffling indices... total parameters: 210062960 setting learning rate to: 0.00025 training start... 0%| | 0/480000 [00:00<?, ?it/s]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1532502421238/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
Traceback (most recent call last):
File "train.py", line 203, in
Hello, I follow what you said, but still can't
what's your version of cuda
cuda 10
The version maybe high, try cuda 8.0 or cuda 9.0. See this: https://github.com/sangwoomo/instagan/issues/4
Hello author, I am bothering you again. I really appreciate your help to me yesterday. I changed cuda to 9, but still can't train, but I can test it.
loading all datasets... using 4 threads loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=22.31s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=19.15s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=18.04s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=23.53s) creating index... index created! loading from cache file: cache/coco_minival2014.pkl loading annotations into memory... Done (t=0.78s) creating index... index created! system config... {'batch_size': 1, 'cache_dir': 'cache', 'chunk_sizes': [1], 'config_dir': 'config', 'data_dir': './data', 'data_rng': <mtrand.RandomState object at 0x7fd20efd0870>, 'dataset': 'MSCOCO', 'decay_rate': 10, 'display': 5, 'learning_rate': 0.00025, 'max_iter': 480000, 'nnet_rng': <mtrand.RandomState object at 0x7fd20efd08b8>, 'opt_algo': 'adam', 'prefetch_size': 6, 'pretrain': None, 'result_dir': 'results', 'sampling_function': 'kp_detection', 'snapshot': 5000, 'snapshot_name': 'CenterNet-104', 'stepsize': 450000, 'test_split': 'testdev', 'train_split': 'trainval', 'val_iter': 500, 'val_split': 'minival', 'weight_decay': False, 'weight_decay_rate': 1e-05, 'weight_decay_type': 'l2'} db config... {'ae_threshold': 0.5, 'border': 128, 'categories': 80, 'data_aug': True, 'gaussian_bump': True, 'gaussian_iou': 0.7, 'gaussian_radius': -1, 'input_size': [511, 511], 'kp_categories': 1, 'lighting': True, 'max_per_image': 100, 'merge_bbox': False, 'nms_algorithm': 'exp_soft_nms', 'nms_kernel': 3, 'nms_threshold': 0.5, 'output_sizes': [[128, 128]], 'rand_color': True, 'rand_crop': True, 'rand_pushes': False, 'rand_samples': False, 'rand_scale_max': 1.4, 'rand_scale_min': 0.6, 'rand_scale_step': 0.1, 'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]), 'special_crop': False, 'test_scales': [1], 'top_k': 70, 'weight_exp': 8} len of db: 118287 start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... shuffling indices... building model... module_file: models.CenterNet-104 start prefetching data... shuffling indices... total parameters: 210062960 setting learning rate to: 0.00025 training start... 0%| | 0/480000 [00:00<?, ?it/s]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1532502421238/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
Traceback (most recent call last):
File "train.py", line 203, in
loading parameters at iteration: 480000 building neural network... module_file: models.CenterNet-104 total parameters: 210062960 loading parameters... loading model from cache/nnet/CenterNet-104/CenterNet-104_480000.pkl locating kps: 0%| | 0/5000 [00:00<?, ?it/s]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1532502421238/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument /home/xinyu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/upsampling.py:122: UserWarning: nn.Upsampling is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.Upsampling is deprecated. Use nn.functional.interpolate instead.") locating kps: 72%|██████████████████ | 3624/5000 [34:05<14:17, 1.60it/
this is test
How about now?
RuntimeError: Expected object of type CUDAByteType but found type CUDAFloatType for argument #0 'result' (checked_cast_tensor at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/ATen/Utils.h:30)
frame #0:
This error occurs and still cannot run
Hello,I have the same issue,and I have two gpus,do I need to change the config like above? if I need to,please tell me how. I have been troubled for 2 weeks,I'll be vrey appreciate if I can fix it
please show your log
please show your log
Thank you for replying this fast ,but I have some wrong with my environment suddenly,I‘ll put it later,thanks again
please show your log
Thank you for replying this fast ,but I have some wrong with my environment suddenly,I‘ll put it later,thanks again
loading all datasets... using 4 threads loading from cache file: cache/coco_hepaticvessel_001.pkl No cache file found... loading annotations into memory... Done (t=0.00s) creating index... index created! 49it [00:00, 36524.06it/s] loading annotations into memory... Done (t=0.00s) creating index... index created! loading from cache file: cache/coco_hepaticvessel_001.pkl loading annotations into memory... Done (t=0.00s) creating index... index created! loading from cache file: cache/coco_hepaticvessel_001.pkl loading annotations into memory... Done (t=0.00s) creating index... index created! loading from cache file: cache/coco_hepaticvessel_001.pkl loading annotations into memory... Done (t=0.00s) creating index... index created! loading from cache file: cache/coco_hepaticvessel_001.pkl loading annotations into memory... Done (t=0.00s) creating index... index created! system config... {'batch_size': 48, 'cache_dir': 'cache', 'chunk_sizes': [6, 6, 6, 6, 6, 6, 6, 6], 'config_dir': 'config', 'data_dir': './data', 'data_rng': <mtrand.RandomState object at 0x7f5fd2cc5ab0>, 'dataset': 'MSCOCO', 'decay_rate': 10, 'display': 5, 'learning_rate': 0.00025, 'max_iter': 480000, 'nnet_rng': <mtrand.RandomState object at 0x7f5fd2cc5af8>, 'opt_algo': 'adam', 'prefetch_size': 6, 'pretrain': None, 'result_dir': 'results', 'sampling_function': 'kp_detection', 'snapshot': 5000, 'snapshot_name': 'CenterNet-104', 'stepsize': 450000, 'test_split': 'testdev', 'train_split': 'trainval', 'val_iter': 500, 'val_split': 'minival', 'weight_decay': False, 'weight_decay_rate': 1e-05, 'weight_decay_type': 'l2'} db config... {'ae_threshold': 0.5, 'border': 128, 'categories': 80, 'data_aug': True, 'gaussian_bump': True, 'gaussian_iou': 0.7, 'gaussian_radius': -1, 'input_size': [512, 512], 'kp_categories': 1, 'lighting': True, 'max_per_image': 100, 'merge_bbox': False, 'nms_algorithm': 'exp_soft_nms', 'nms_kernel': 3, 'nms_threshold': 0.5, 'output_sizes': [[128, 128]], 'rand_color': True, 'rand_crop': True, 'rand_pushes': False, 'rand_samples': False, 'rand_scale_max': 1.4, 'rand_scale_min': 0.6, 'rand_scale_step': 0.1, 'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]), 'special_crop': False, 'test_scales': [1], 'top_k': 70, 'weight_exp': 8} len of db: 49 start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... shuffling indices... building model... module_file: models.CenterNet-104 start prefetching data... shuffling indices... shuffling indices... shuffling indices... shuffling indices... shuffling indices... shuffling indices... shuffling indices... shuffling indices... shuffling indices... shuffling indices... shuffling indices... shuffling indices... shuffling indices... shuffling indices... shuffling indices... shuffling indices... total parameters: 210062960 setting learning rate to: 0.00025 training start... 0%| | 0/480000 [00:00<?, ?it/s]shuffling indices... shuffling indices...
Traceback (most recent call last):
File "train.py", line 203, in
Modify 'batch_size' to 3 and 'chunk_sizes' to [3] in config/CenterNet-104.json. If out of memory, then modify 'batch_size' to 2 and 'chunk_sizes' to [2]
Modify 'batch_size' to 3 and 'chunk_sizes' to [3] in config/CenterNet-104.json. If out of memory, then modify 'batch_size' to 2 and 'chunk_sizes' to [2]
I'll try it thank you , and I've tried batch_size 8 and chunk_size [4,4] it worked, I wonder if most these issues are about batch_size and chun_size set?
Yes
Yes
Thank you for your answer and patience
No problem
loading all datasets...
using 4 threads
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=2406.72s)
creating index...
index created!
Traceback (most recent call last):
File "train.py", line 193, in
i'm sorry ,Disturb you again,when i run this code in 1080 (cuda9 and torch0.41),This happens. How can I solve it?
Try this:
cd
Thank you for answering my question late at night. I did what you said, but this happened. python setup.py build_ext --inplace running build_ext skipping 'pycocotools/_mask.c' Cython extension (up-to-date) building 'pycocotools._mask' extension creating build creating build/common creating build/temp.linux-x86_64-3.6 creating build/temp.linux-x86_64-3.6/pycocotools gcc -pthread -B /home/zq/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/zq/anaconda3/lib/python3.6/site-packages/numpy/core/include -I../common -I/home/zq/anaconda3/include/python3.6m -c ../common/maskApi.c -o build/temp.linux-x86_64-3.6/../common/maskApi.o -Wno-cpp -Wno-unused-function -std=c99 ../common/maskApi.c: In function ‘rleToBbox’: ../common/maskApi.c:141:31: warning: ‘xp’ may be used uninitialized in this function [-Wmaybe-uninitialized] if(j%2==0) xp=x; else if(xp<x) { ys=0; ye=h-1; } ^ gcc -pthread -B /home/zq/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/zq/anaconda3/lib/python3.6/site-packages/numpy/core/include -I../common -I/home/zq/anaconda3/include/python3.6m -c pycocotools/_mask.c -o build/temp.linux-x86_64-3.6/pycocotools/_mask.o -Wno-cpp -Wno-unused-function -std=c99 creating build/lib.linux-x86_64-3.6 creating build/lib.linux-x86_64-3.6/pycocotools gcc -pthread -shared -B /home/zq/anaconda3/compiler_compat -L/home/zq/anaconda3/lib -Wl,-rpath=/home/zq/anaconda3/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.6/../common/maskApi.o build/temp.linux-x86_64-3.6/pycocotools/_mask.o -o build/lib.linux-x86_64-3.6/pycocotools/_mask.cpython-36m-x86_64-linux-gnu.so copying build/lib.linux-x86_64-3.6/pycocotools/_mask.cpython-36m-x86_64-linux-gnu.so -> pycocotools rm -rf build
This error occurred when I ran the program.
zq@zq-G1-SNIPER-B7:~/辛宇/CenterNet-master$ python train.py CornerNet
Traceback (most recent call last):
File "train.py", line 18, in
what's your torch version?
Traceback (most recent call last): File "train.py", line 203, in
train(training_dbs, validation_db, args.start_iter)
File "train.py", line 138, in train
training_loss, focal_loss, pull_loss, push_loss, regr_loss = nnet.train(*training)
File "/home/xinyu/CenterNet-master/nnet/py_factory.py", line 82, in train
loss_kp = self.network(xs, ys)
File "/home/xinyu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(input, *kwargs)
File "/home/xinyu/CenterNet-master/models/py_utils/data_parallel.py", line 66, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids, self.chunk_sizes)
File "/home/xinyu/CenterNet-master/models/py_utils/data_parallel.py", line 77, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim, chunk_sizes=self.chunk_sizes)
File "/home/xinyu/CenterNet-master/models/py_utils/scatter_gather.py", line 30, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim, chunk_sizes) if inputs else []
File "/home/xinyu/CenterNet-master/models/py_utils/scatter_gather.py", line 25, in scatter
return scatter_map(inputs)
File "/home/xinyu/CenterNet-master/models/py_utils/scatter_gather.py", line 18, in scatter_map
return list(zip(map(scatter_map, obj)))
File "/home/xinyu/CenterNet-master/models/py_utils/scatter_gather.py", line 20, in scatter_map
return list(map(list, zip(map(scatter_map, obj))))
File "/home/xinyu/CenterNet-master/models/py_utils/scatter_gather.py", line 15, in scatter_map
return Scatter.apply(target_gpus, chunk_sizes, dim, obj)
File "/home/xinyu/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 87, in forward
outputs = comm.scatter(input, ctx.target_gpus, ctx.chunk_sizes, ctx.dim, streams)
File "/home/xinyu/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 142, in scatter
return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error (10): invalid device ordinal (check_status at /opt/conda/conda-bld/pytorch_1532502421238/work/aten/src/ATen/cuda/detail/CUDAHooks.cpp:36)
frame #0: torch::cuda::scatter(at::Tensor const&, at::ArrayRef, at::optional<std::vector<long, std::allocator > > const&, long, at::optional<std::vector<CUDAStreamInternals , std::allocator<CUDAStreamInternals*> > > const&) + 0x4e1 (0x7fac77038871 in /home/xinyu/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #1: + 0xc42a0b (0x7fac77040a0b in /home/xinyu/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: + 0x38a5cb (0x7fac767885cb in /home/xinyu/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)