eto-ai / rikai

Parquet-based ML data format optimized for working with unstructured data
https://rikai.readthedocs.io/en/latest/
Apache License 2.0
138 stars 19 forks source link

Benchmark performance of loading Rikai datasets #468

Closed eddyxu closed 2 years ago

eddyxu commented 2 years ago

Tested Rikai Dataset with Pytorch 1.8+ DataLoader on AWS and a local station.

Experiments configurations:

Local AMD 5900X 12C / 24T
64 GB RAM
Nvidia Titan RTX
2TB Sumsung 970 Evo Plus Pcie
AWS c5.2xlarge
AWS g4dn.xlarge

Dataset: Coco, stored as both embedded image and external images. Model: torchvision.models.detection.SSD

  Embedded Image External Image
Local + no model 1150 images / sec 1100 images / sec
Local + SSD 61 images / sec 60 images / sec
Local + SSD (jit) 68 images / sec 70 images / sec

AWS EC2 (c5.2xlarge)

Workers Batch Size Embedded Image External Image
8 8 500 images / sec 10 images / sec
8 64 9 images / sec
16 8 21 images / sec
32 8 34 images / sec
32 64 33 images / sec

AWS g4dn.xlarge, Nvidia T4

Workers Batch Size Embedded Image External Image
8 8 30 images / sec images / sec
8 32 35 9 images / sec
16 8 30 images / sec
16 32 . 17 images / sec
32 8 images / sec
32 64 images / sec
eddyxu commented 2 years ago

It appears to me that this was due to small file access on S3, and Rikai Dataset + Pytorch DataLoader do not have enough parallel downloads prefetching the images.

Did not test T4 and A100 / V100 yet. Estimated that T4 is about half of the performance Titan RTX, we could expect about 30 images / sec on T4, so we need a large number of DataLoader workers to achieve that.

A potential improvement might be use asyncio or threadpool in each worker to increase a larger amount of parallel I/Os to S3?

eddyxu commented 2 years ago

Profiling (python -m cProfile)results on local workstation

$ python -c "import pstats;from pstats import SortKey;p=pstats.Stats('torch.prof');p.sort_stats(SortKey.TIME, SortKey.CUMULATIVE).print_stats(20)"
Wed Jan  5 19:18:35 2022    torch.prof

         3254077 function calls (3225258 primitive calls) in 51.600 seconds

   Ordered by: internal time, cumulative time
   List reduced from 7431 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       43   20.238    0.471   34.444    0.801 /home/lei/miniconda3/envs/benchmark/lib/python3.9/site-packages/torchvision/models/detection/ssd.py:364(postprocess_detections)
     3782   12.103    0.003   12.104    0.003 {method 'to' of 'torch._C._TensorBase' objects}
    57408    5.333    0.000    5.333    0.000 {built-in method torch._ops.torchvision.nms}
     2573    3.274    0.001   10.670    0.004 /home/lei/miniconda3/envs/benchmark/lib/python3.9/site-packages/torchvision/ops/boxes.py:91(_batched_nms_vanilla)
    59817    1.671    0.000    1.671    0.000 {built-in method where}
   246384    1.176    0.000    1.176    0.000 {built-in method full_like}
      212    1.112    0.005    1.112    0.005 {method 'acquire' of '_thread.lock' objects}
   246384    1.032    0.000    1.032    0.000 {method 'topk' of 'torch._C._TensorBase' objects}
     1528    0.917    0.001    0.917    0.001 {built-in method conv2d}
       98    0.664    0.007    0.664    0.007 {method 'uniform_' of 'torch._C._TensorBase' objects}
     2738    0.460    0.000    0.678    0.000 /home/lei/miniconda3/envs/benchmark/lib/python3.9/site-packages/torchvision/models/detection/_utils.py:187(decode_single)
       16    0.442    0.028    0.442    0.028 {method 'normal_' of 'torch._C._TensorBase' objects}
    19209    0.245    0.000    0.245    0.000 {built-in method tensor}
  107/106    0.179    0.002    0.179    0.002 {built-in method _imp.create_dynamic}
   246655    0.176    0.000    0.176    0.000 {method 'size' of 'torch._C._TensorBase' objects}
     2573    0.173    0.000    0.173    0.000 {built-in method _unique2}
       43    0.164    0.004   10.438    0.243 /home/lei/miniconda3/envs/benchmark/lib/python3.9/site-packages/torchvision/models/detection/anchor_utils.py:235(forward)
    11348    0.158    0.000    0.158    0.000 {built-in method cat}
        8    0.137    0.017    0.137    0.017 {built-in method posix.fork}
   253124    0.126    0.000    0.126    0.000 {built-in method builtins.min}

So the majority of the python execution (besides I/O and C and Cuda), is in SSD'model postprocess, as well as move data from CPU to GPU.

changhiskhan commented 2 years ago

DONE!