Closed haoran1062 closed 4 years ago
@haoran1062 It seems you have modified the config file. Can you run the ABCNet demo successfully (without changing anything)?
@haoran1062 It seems you have modified the config file. Can you run the ABCNet demo successfully (without changing anything)?
I run the CTW1500 train, and it worked. But when I train my own datasets, it always occurred the CUDA error. I only modify the config like that:
_BASE_: "../Base-BAText.yaml"
MODEL:
BATEXT:
POOLER_RESOLUTION: (8,128)
NUM_CHARS: 6900
FCOS:
INFERENCE_TH_TEST: 0.6
DATASETS:
TRAIN: ("my_train",)
TEST: ("my_test",)
INPUT:
MIN_SIZE_TEST: 1024
MAX_SIZE_TEST: 2240
and I outputs my data, looks all right...
[{'file_name': 'datasets/my/images/1104.jpg', 'height': 4029, 'width': 3021, 'image_id': 1104, 'image': tensor([[[184., 184., 185., ..., 204., 201., 200.],
[182., 184., 186., ..., 204., 200., 200.],
[185., 185., 185., ..., 202., 202., 202.],
...,
[182., 183., 185., ..., 188., 189., 189.],
[181., 183., 186., ..., 188., 189., 190.],
[179., 182., 184., ..., 190., 190., 190.]],
[[191., 191., 190., ..., 219., 217., 217.],
[189., 191., 191., ..., 219., 217., 217.],
[190., 191., 190., ..., 217., 219., 219.],
...,
[181., 181., 181., ..., 197., 200., 200.],
[180., 180., 182., ..., 197., 199., 200.],
[178., 181., 181., ..., 199., 199., 199.]],
[[200., 200., 199., ..., 228., 226., 226.],
[198., 200., 200., ..., 228., 226., 226.],
[199., 200., 199., ..., 226., 228., 228.],
...,
[183., 181., 183., ..., 206., 208., 208.],
[182., 182., 184., ..., 206., 207., 208.],
[180., 183., 183., ..., 209., 208., 208.]]]), 'instances': Instances(num_instances=47, image_height=704, image_width=967, fields=[gt_boxes: Boxes(tensor([[173.7682, 364.5951, 249.3772, 384.8136],
[ 25.2030, 452.0979, 321.3385, 476.2938],
[162.1615, 97.1149, 258.9942, 118.9906],
[169.4571, 234.9981, 458.9602, 259.5254],
[ 27.5243, 494.8550, 286.1869, 519.0508],
[ 79.5885, 0.0000, 354.1687, 34.1394],
[ 16.5809, 298.9680, 97.4959, 323.1638],
[ 26.5295, 472.3164, 263.6368, 497.5066],
[153.5394, 298.9680, 214.2256, 319.1864],
[ 17.5758, 320.8437, 109.1025, 345.3710],
[164.8145, 130.9228, 316.6958, 154.1243],
[540.8701, 362.9379, 646.6564, 385.1450],
[ 25.2030, 408.3465, 179.7373, 432.5424],
[809.1495, 357.6346, 948.0977, 381.1676],
[ 10.6118, 234.0038, 134.9688, 261.8456],
[165.1461, 164.7307, 318.3539, 188.5951],
[ 23.8765, 430.8851, 286.1869, 455.0810],
[ 76.9355, 27.8418, 349.8577, 50.0490],
[698.0573, 490.2147, 749.7898, 507.7816],
[440.0580, 408.6780, 521.9678, 426.2448],
[678.1602, 361.9435, 766.3707, 384.1507],
[441.3844, 429.5593, 521.9678, 447.4576],
[ 11.2750, 200.8588, 136.6269, 225.3861],
[ 9.6169, 129.5970, 132.3158, 154.7872],
[413.8601, 363.6007, 521.9678, 385.8079],
[839.9901, 487.2316, 927.8690, 506.4557],
[ 9.2853, 94.7947, 129.6629, 121.9736],
[ 12.2699, 163.0734, 131.9842, 189.5894],
[458.9602, 493.5292, 524.2891, 512.7533],
[441.3844, 450.7721, 522.9626, 469.9962],
[446.3587, 472.3164, 524.2891, 490.2147],
[582.6540, 428.5650, 607.1938, 447.4576],
[151.8813, 320.8437, 220.1948, 342.3879],
[400.2637, 43.7514, 598.5717, 83.8569],
[162.8248, 197.8757, 259.9890, 224.3917],
[582.9856, 408.3465, 605.8673, 424.9190],
[584.3121, 448.4520, 605.8673, 467.6761],
[585.3069, 469.6648, 607.1938, 490.2147],
[586.6334, 493.5292, 609.5151, 510.1017],
[689.4352, 407.0207, 752.4427, 424.9190],
[692.0881, 427.2392, 753.7692, 445.1375],
[694.4095, 447.1262, 749.7898, 466.3503],
[695.7360, 469.6648, 749.7898, 486.2373],
[859.8871, 400.7232, 897.6917, 419.9473],
[862.5401, 424.5876, 902.6660, 442.4859],
[862.5401, 444.8060, 900.3447, 462.3729],
[862.5401, 464.6930, 900.3447, 484.9115]])), gt_classes: tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), beziers: tensor([[173.7682, 364.5951, 198.8618, 364.5951, 223.9520, 364.5951, 249.0456,
364.5951, 249.0456, 384.4821, 223.9520, 384.4821, 198.8618, 384.4821,
173.7682, 384.4821],
[ 25.2030, 454.7495, 122.9111, 453.2381, 220.6392, 453.1055, 318.3539,
452.0979, 321.0069, 472.3164, 222.4034, 473.4135, 123.7999, 474.3515,
25.2030, 475.9623],
[164.1512, 97.4463, 194.6535, 96.7105, 225.1691, 97.3502, 255.6780,
97.1149, 258.6625, 118.6591, 226.4955, 118.6591, 194.3285, 118.6591,
162.1615, 118.6591],
[169.4571, 234.9981, 265.8487, 234.9981, 362.2370, 234.9981, 458.6286,
234.9981, 458.6286, 259.1940, 362.2370, 259.1940, 265.8487, 259.1940,
169.4571, 259.1940],
[ 27.5243, 497.1751, 113.6358, 496.3995, 199.7439, 495.6273, 285.8553,
494.8550, 285.8553, 514.7420, 200.6260, 515.8590, 115.4000, 517.0853,
30.1773, 518.7194],
[ 79.5885, 0.0000, 171.0058, 1.6539, 262.4297, 2.8372, 353.8371,
4.9718, 350.8525, 33.8079, 260.5528, 31.3187, 170.2298, 29.8570,
79.9201, 27.8418],
[ 16.5809, 300.6252, 42.9944, 299.6441, 69.4144, 299.4619, 95.8378,
298.9680, 97.1643, 321.8380, 70.2965, 321.6557, 43.4454, 322.5275,
16.5809, 322.8324],
[ 26.5295, 477.2881, 104.5560, 474.9945, 182.6158, 474.0996, 260.6523,
472.3164, 263.3052, 492.2034, 185.1526, 493.8573, 106.9901, 495.0340,
28.8508, 497.1751],
[153.5394, 298.9680, 173.6587, 298.9680, 193.7747, 298.9680, 213.8940,
298.9680, 213.8940, 318.8550, 193.7747, 318.8550, 173.6587, 318.8550,
153.5394, 318.8550],
[ 17.5758, 320.8437, 47.5375, 320.6614, 77.4860, 321.5431, 107.4444,
321.8380, 108.7709, 342.0565, 78.9087, 342.6100, 49.0796, 343.9922,
19.2339, 345.0396],
[164.8145, 130.9228, 215.3299, 130.9228, 265.8487, 130.9228, 316.3642,
130.9228, 316.3642, 153.7928, 265.8487, 153.7928, 215.3299, 153.7928,
164.8145, 153.7928],
[540.8701, 362.9379, 576.0216, 362.9379, 611.1732, 362.9379, 646.3248,
362.9379, 646.3248, 384.8136, 611.1732, 384.8136, 576.0216, 384.8136,
540.8701, 384.8136],
[ 25.2030, 411.9925, 76.1496, 410.2523, 127.1194, 409.5928, 178.0792,
408.3465, 179.4057, 429.5593, 128.0048, 430.4410, 76.5940, 430.8354,
25.2030, 432.2109],
[809.1495, 357.6346, 855.3539, 357.6346, 901.5617, 357.6346, 947.7661,
357.6346, 947.7661, 380.1732, 902.5532, 379.9711, 857.3470, 380.5577,
812.1341, 380.8362],
[ 12.9331, 234.9981, 53.0590, 234.5440, 93.1815, 233.9474, 133.3107,
234.0038, 134.6372, 258.5311, 93.2843, 259.1012, 51.9547, 260.4535,
10.6118, 261.5141],
[165.1461, 164.7307, 216.1059, 164.7307, 267.0624, 164.7307, 318.0223,
164.7307, 318.0223, 188.2637, 267.0624, 188.2637, 216.1059, 188.2637,
165.1461, 188.2637],
[ 23.8765, 433.5367, 111.1951, 432.0220, 198.5301, 431.8961, 285.8553,
430.8851, 285.8553, 449.7778, 198.5301, 451.4317, 111.1917, 452.5984,
23.8765, 454.7495],
[ 78.5936, 27.8418, 168.9033, 29.3930, 259.2164, 30.9409, 349.5261,
32.4821, 349.5261, 49.7175, 258.6824, 46.6682, 167.7957, 44.9877,
76.9355, 42.4256],
[698.0573, 490.2147, 715.1920, 490.2147, 732.3234, 490.2147, 749.4582,
490.2147, 749.4582, 506.7872, 732.7579, 506.5784, 716.0807, 507.1717,
699.3837, 507.4501],
[440.0580, 409.6723, 466.4747, 409.2216, 492.8914, 408.6150, 519.3148,
408.6780, 521.6362, 425.9134, 494.8845, 425.9134, 468.1361, 425.9134,
441.3844, 425.9134],
[679.4866, 361.9435, 708.3375, 361.9435, 737.1883, 361.9435, 766.0391,
361.9435, 766.0391, 383.8192, 736.7472, 383.3983, 707.4587, 383.0701,
678.1602, 383.1563],
[441.3844, 429.8908, 468.1262, 429.1549, 494.8878, 429.7946, 521.6362,
429.5593, 521.6362, 447.1262, 494.8845, 447.1262, 468.1361, 447.1262,
441.3844, 447.1262],
[ 13.5964, 200.8588, 53.9444, 200.8588, 94.2891, 200.8588, 134.6372,
200.8588, 136.2953, 224.0603, 94.6174, 223.8846, 52.9496, 224.7431,
[ 9.6169, 129.5970, 50.4060, 129.5970, 91.1951, 129.5970, 131.9842,
129.5970, 131.9842, 154.4557, 91.1951, 154.4557, 50.4060, 154.4557,
9.6169, 154.4557],
[413.8601, 363.6007, 449.7843, 363.6007, 485.7119, 363.6007, 521.6362,
363.6007, 521.6362, 385.4765, 485.7119, 385.4765, 449.7843, 385.4765,
413.8601, 385.4765],
[839.9901, 487.2316, 868.7314, 487.2316, 897.4695, 487.2316, 926.2109,
487.2316, 927.5374, 504.7985, 898.6866, 505.1200, 869.8291, 505.2924,
840.9849, 506.1243],
[ 12.2699, 95.7891, 51.2881, 95.3383, 90.3064, 94.7351, 129.3313,
94.7947, 129.3313, 119.3220, 89.3148, 120.0877, 49.3017, 120.8666,
9.2853, 121.6422],
[ 12.2699, 166.3880, 51.5070, 165.1616, 90.7408, 163.7794, 129.9945,
163.0734, 131.6526, 187.9322, 92.2994, 188.2570, 52.9429, 188.4227,
13.5964, 189.2580],
[459.9551, 494.1921, 481.2848, 493.7711, 502.6178, 493.4364, 523.9575,
493.5292, 521.6362, 512.4219, 500.7442, 512.4219, 479.8522, 512.4219,
458.9602, 512.4219],
[443.3741, 451.1036, 469.4526, 450.3678, 495.5510, 451.0108, 521.6362,
450.7721, 522.6310, 468.3390, 495.5477, 468.6638, 468.4611, 468.8329,
441.3844, 469.6648],
[446.3587, 473.3107, 471.4490, 472.8567, 496.5392, 472.2567, 521.6362,
472.3164, 523.9575, 489.8832, 498.8639, 489.8832, 473.7736, 489.8832,
448.6801, 489.8832],
[583.6488, 428.5650, 589.9496, 428.5650, 596.2504, 428.5650, 602.5511,
428.5650, 606.8621, 447.1262, 598.8204, 446.3970, 590.7189, 447.0201,
582.6540, 446.7947],
[153.8711, 320.8437, 174.9852, 320.8437, 196.0961, 320.8437, 217.2102,
320.8437, 219.8632, 342.0565, 197.2037, 342.0565, 174.5408, 342.0565,
151.8813, 342.0565],
[400.2637, 43.7514, 466.2558, 43.7514, 532.2479, 43.7514, 598.2401,
43.7514, 598.2401, 83.5254, 532.2479, 83.5254, 466.2558, 83.5254,
400.2637, 83.5254],
[162.8248, 197.8757, 195.1012, 197.8757, 227.3810, 197.8757, 259.6574,
197.8757, 259.6574, 224.0603, 227.3810, 224.0603, 195.1012, 224.0603,
162.8248, 224.0603],
[582.9856, 408.3465, 590.5034, 408.3465, 598.0179, 408.3465, 605.5356,
408.3465, 605.5356, 424.5876, 598.0179, 424.5876, 590.5034, 424.5876,
582.9856, 424.5876],
[584.3121, 448.4520, 591.3855, 448.4520, 598.4622, 448.4520, 605.5356,
448.4520, 605.5356, 467.3446, 598.4622, 467.3446, 591.3855, 467.3446,
584.3121, 467.3446],
[585.3069, 469.6648, 592.4931, 469.6648, 599.6760, 469.6648, 606.8621,
469.6648, 606.8621, 489.8832, 599.6760, 489.8832, 592.4931, 489.8832,
585.3069, 489.8832],
[586.6334, 493.5292, 594.1512, 493.5292, 601.6656, 493.5292, 609.1835,
493.5292, 609.1835, 509.7702, 601.6656, 509.7702, 594.1512, 509.7702,
586.6334, 509.7702],
[689.4352, 407.0207, 710.3271, 407.0207, 731.2191, 407.0207, 752.1111,
407.0207, 752.1111, 424.5876, 731.2191, 424.5876, 710.3271, 424.5876,
689.4352, 424.5876],
[692.0881, 427.2392, 712.5391, 427.2392, 732.9866, 427.2392, 753.4376,
427.2392, 753.4376, 444.8060, 732.9866, 444.8060, 712.5391, 444.8060,
692.0881, 444.8060],
[694.4095, 447.1262, 712.7579, 447.1262, 731.1097, 447.1262, 749.4582,
447.1262, 749.4582, 466.0188, 731.1097, 466.0188, 712.7579, 466.0188,
694.4095, 466.0188],
[695.7360, 469.6648, 713.6434, 469.6648, 731.5508, 469.6648, 749.4582,
469.6648, 749.4582, 485.9059, 731.5508, 485.9059, 713.6434, 485.9059,
695.7360, 485.9059],
[859.8871, 400.7232, 872.3793, 400.7232, 884.8680, 400.7232, 897.3601,
400.7232, 897.3601, 419.6158, 884.8680, 419.6158, 872.3793, 419.6158,
859.8871, 419.6158],
[862.5401, 424.5876, 875.8049, 424.5876, 889.0696, 424.5876, 902.3344,
424.5876, 902.3344, 442.1544, 889.0696, 442.1544, 875.8049, 442.1544,
862.5401, 442.1544],
[862.5401, 444.8060, 875.0322, 444.8060, 887.5209, 444.8060, 900.0131,
444.8060, 900.0131, 462.0414, 887.5209, 462.0414, 875.0322, 462.0414,
862.5401, 462.0414],
[862.5401, 464.6930, 875.0322, 464.6930, 887.5209, 464.6930, 900.0131,
464.6930, 900.0131, 484.5800, 887.5209, 484.5800, 875.0322, 484.5800,
862.5401, 484.5800]]), text: tensor([[ 359, 6307, 5360, ..., 6830, 6830, 6830],
[ 143, 1970, 6288, ..., 6830, 6830, 6830],
[ 15, 16, 18, ..., 6830, 6830, 6830],
...,
[4305, 5701, 6830, ..., 6830, 6830, 6830],
[4305, 5701, 6830, ..., 6830, 6830, 6830],
[4305, 5701, 6830, ..., 6830, 6830, 6830]], dtype=torch.int32)])}]
and error occurred:
/opt/conda/conda-bld/pytorch_1587428398394/work/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated:
nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
nonzero(Tensor input, *, bool as_tuple)
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428398394/work/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered
cuda runtime error (700) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1587428398394/work/aten/src/THC/THCCachingHostAllocator.cpp:278
Traceback (most recent call last):
File "tools/train_net.py", line 243, in <module>
args=(args,),
File "/data/projects/detectron2/detectron2/engine/launch.py", line 57, in launch
main_func(*args)
File "tools/train_net.py", line 231, in main
return trainer.train()
File "tools/train_net.py", line 113, in train
self.train_loop(self.start_iter, self.max_iter)
File "tools/train_net.py", line 102, in train_loop
self.run_step()
File "/data/projects/detectron2/detectron2/engine/train_loop.py", line 217, in run_step
print(loss_dict)
File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 162, in __repr__
return torch._tensor_str._str(self)
File "/opt/conda/lib/python3.7/site-packages/torch/_tensor_str.py", line 315, in _str
tensor_str = _tensor_str(self, indent)
File "/opt/conda/lib/python3.7/site-packages/torch/_tensor_str.py", line 213, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/opt/conda/lib/python3.7/site-packages/torch/_tensor_str.py", line 88, in __init__
nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: CUDA error: an illegal memory access was encountered
We haven't tested it on Chinese datasets. Please make sure you have enough GPU memory.
We haven't tested it on Chinese datasets. Please make sure you have enough GPU memory.
But never CUDA out of memory
, and I just use one 1080Ti and batch size = 1, I see the GPU memory only used 800MB when occurred the error of RuntimeError: CUDA error: an illegal memory access was encountered
@haoran1062 NUM_CHARS represents the max length could be possible exist in one instance, which should impossible be 6900.
If you want to change the number of the classes, change MODEL.BATEXT.VOC_SIZE. Also pay attention to the class of the "EOF" symbol.
@haoran1062 NUM_CHARS represents the max length could be possible exist in one instance, which should impossible be 6900.
If you want to change the number of the classes, change MODEL.BATEXT.VOC_SIZE. Also pay attention to the class of the "EOF" symbol.
Thanks, that's my bad... now this error fixed, but another error occured:
Traceback (most recent call last): [6/1819]
File "tools/train_net.py", line 243, in <module>
args=(args,),
File "/data/projects/detectron2/detectron2/engine/launch.py", line 57, in launch
main_func(*args)
File "tools/train_net.py", line 231, in main
return trainer.train()
File "tools/train_net.py", line 113, in train
self.train_loop(self.start_iter, self.max_iter)
File "tools/train_net.py", line 102, in train_loop
self.run_step()
File "/data/projects/detectron2/detectron2/engine/train_loop.py", line 209, in run_step
data = next(self._data_loader_iter)
File "/data/projects/detectron2/detectron2/data/common.py", line 142, in __iter__
for d in self.dataset:
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/projects/detectron2/detectron2/data/common.py", line 41, in __getitem__
data = self._map_func(self._dataset[cur_idx])
File "/data/projects/detectron2/detectron2/utils/serialize.py", line 23, in __call__
return self._obj(*args, **kwargs)
File "/data/projects/AdelaiDet-master/adet/data/dataset_mapper.py", line 94, in __call__
raise e
File "/data/projects/AdelaiDet-master/adet/data/dataset_mapper.py", line 91, in __call__
image, transforms = T.apply_transform_gens(self.tfm_gens, image)
File "/data/projects/detectron2/detectron2/data/transforms/transform_gen.py", line 535, in apply_transform_gens
tfm = g.get_transform(img) if isinstance(g, TransformGen) else g
File "/data/projects/detectron2/detectron2/data/transforms/transform_gen.py", line 251, in get_transform
newh = int(newh + 0.5)
ValueError: cannot convert float NaN to integer
but I checked the image, w&h not zero.
@haoran1062 This issue might happen if you have illegal annotations. Also make sure you use the latest version of this project.
@haoran1062 This issue might happen if you have illegal annotations. Also make sure you use the latest version of this project.
I found this error cause of the crop function, I set crop_gen
= False, then I can continue training the model. And 6900 cls is too large that cause CUDA OOM. But I found that if using one card OOM could continue training, but multi-gpu will hanging.
@haoran1062 Glad to hear that you can train now.
Just a reminder, we have tested crop_gen function before, it should work. Gotta be something wrong somewhere.
I don't know how to solve this bug, what happened...