Closed eddyxu closed 2 years ago
It appears to me that this was due to small file access on S3, and Rikai Dataset + Pytorch DataLoader do not have enough parallel downloads prefetching the images.
Did not test T4 and A100 / V100 yet. Estimated that T4 is about half of the performance Titan RTX, we could expect about 30 images / sec on T4, so we need a large number of DataLoader workers to achieve that.
A potential improvement might be use asyncio or threadpool in each worker to increase a larger amount of parallel I/Os to S3?
Profiling (python -m cProfile
)results on local workstation
$ python -c "import pstats;from pstats import SortKey;p=pstats.Stats('torch.prof');p.sort_stats(SortKey.TIME, SortKey.CUMULATIVE).print_stats(20)"
Wed Jan 5 19:18:35 2022 torch.prof
3254077 function calls (3225258 primitive calls) in 51.600 seconds
Ordered by: internal time, cumulative time
List reduced from 7431 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
43 20.238 0.471 34.444 0.801 /home/lei/miniconda3/envs/benchmark/lib/python3.9/site-packages/torchvision/models/detection/ssd.py:364(postprocess_detections)
3782 12.103 0.003 12.104 0.003 {method 'to' of 'torch._C._TensorBase' objects}
57408 5.333 0.000 5.333 0.000 {built-in method torch._ops.torchvision.nms}
2573 3.274 0.001 10.670 0.004 /home/lei/miniconda3/envs/benchmark/lib/python3.9/site-packages/torchvision/ops/boxes.py:91(_batched_nms_vanilla)
59817 1.671 0.000 1.671 0.000 {built-in method where}
246384 1.176 0.000 1.176 0.000 {built-in method full_like}
212 1.112 0.005 1.112 0.005 {method 'acquire' of '_thread.lock' objects}
246384 1.032 0.000 1.032 0.000 {method 'topk' of 'torch._C._TensorBase' objects}
1528 0.917 0.001 0.917 0.001 {built-in method conv2d}
98 0.664 0.007 0.664 0.007 {method 'uniform_' of 'torch._C._TensorBase' objects}
2738 0.460 0.000 0.678 0.000 /home/lei/miniconda3/envs/benchmark/lib/python3.9/site-packages/torchvision/models/detection/_utils.py:187(decode_single)
16 0.442 0.028 0.442 0.028 {method 'normal_' of 'torch._C._TensorBase' objects}
19209 0.245 0.000 0.245 0.000 {built-in method tensor}
107/106 0.179 0.002 0.179 0.002 {built-in method _imp.create_dynamic}
246655 0.176 0.000 0.176 0.000 {method 'size' of 'torch._C._TensorBase' objects}
2573 0.173 0.000 0.173 0.000 {built-in method _unique2}
43 0.164 0.004 10.438 0.243 /home/lei/miniconda3/envs/benchmark/lib/python3.9/site-packages/torchvision/models/detection/anchor_utils.py:235(forward)
11348 0.158 0.000 0.158 0.000 {built-in method cat}
8 0.137 0.017 0.137 0.017 {built-in method posix.fork}
253124 0.126 0.000 0.126 0.000 {built-in method builtins.min}
So the majority of the python execution (besides I/O and C and Cuda), is in SSD'model postprocess, as well as move data from CPU to GPU.
DONE!
Tested Rikai Dataset with Pytorch 1.8+ DataLoader on AWS and a local station.
Experiments configurations:
Dataset: Coco, stored as both embedded image and external images. Model:
torchvision.models.detection.SSD
AWS EC2 (
c5.2xlarge
)AWS g4dn.xlarge, Nvidia T4