DeepRec-AI / HybridBackend

A high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster
Apache License 2.0
156 stars 30 forks source link

[DATA] Implement zero-copied string dtype and accelerate shuffle. #149

Closed francktcheng closed 1 year ago

francktcheng commented 1 year ago
  1. Implement a zero-copied approach to read string data from Arrow to TF.
  2. Accelerate the shuffle operation of string type in ParquetDataset.

preliminary benchmarking results

Dataset list type shuffling throughput (samples/s) speedup over TFRecord
TFRecord N N 1404.23 1.0
HbParquet N N 41137.53 29.3
HbParquet-ZeroCopy N N 51335.40 36.56
TFRecord N Y 1343.10 1.0
HbParquet N Y 6629.60 4.9
HbParquet-ZeroCopy N Y 10941.25 8.1
TFRecord Y N 1352.05 1.0
HbParquet Y N 2307.33 1.71
HbParquet-ZeroCopy Y N 2869.98 2.12
TFRecord Y Y 1367.96 1.0
HbParquet Y Y 1080.03 0.79
HbParquet-ZeroCopy Y Y 1454.02 1.06
github-actions[bot] commented 1 year ago

Test Results

  48 files  ±0    48 suites  ±0   1m 53s :stopwatch: -2s   52 tests  - 1    52 :heavy_check_mark:  - 1    0 :zzz: ±0  0 :x: ±0  156 runs   - 3  131 :heavy_check_mark:  - 3  25 :zzz: ±0  0 :x: ±0 

Results for commit c477fae7. ± Comparison against base commit 0545159a.

This pull request removes 1 test. ``` ParquetDatasetStringTest ‑ test_unbatch_and_to_sparse ```

:recycle: This comment has been updated with latest results.