[DATA] Implement zero-copied string dtype and accelerate shuffle.

DeepRec-AI / HybridBackend

A high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster

Apache License 2.0

156 stars 30 forks source link

Implement a zero-copied approach to read string data from Arrow to TF.
Accelerate the shuffle operation of string type in ParquetDataset.

preliminary benchmarking results

col=300, batch_size=1000
Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz with 128 logical cores.

Dataset	list type	shuffling	throughput (samples/s)	speedup over TFRecord
TFRecord	N	N	1404.23	1.0
HbParquet	N	N	41137.53	29.3
HbParquet-ZeroCopy	N	N	51335.40	36.56
TFRecord	N	Y	1343.10	1.0
HbParquet	N	Y	6629.60	4.9
HbParquet-ZeroCopy	N	Y	10941.25	8.1
TFRecord	Y	N	1352.05	1.0
HbParquet	Y	N	2307.33	1.71
HbParquet-ZeroCopy	Y	N	2869.98	2.12
TFRecord	Y	Y	1367.96	1.0
HbParquet	Y	Y	1080.03	0.79
HbParquet-ZeroCopy	Y	Y	1454.02	1.06

Test Results

  48 files ±0   48 suites ±0 1m 53s :stopwatch: -2s   52 tests - 1   52 :heavy_check_mark: - 1   0 :zzz: ±0 0 :x: ±0 156 runs - 3 131 :heavy_check_mark: - 3 25 :zzz: ±0 0 :x: ±0

Results for commit c477fae7. ± Comparison against base commit 0545159a.

This pull request removes 1 test.

``` ParquetDatasetStringTest ‑ test_unbatch_and_to_sparse ```

:recycle: This comment has been updated with latest results.

DeepRec-AI / HybridBackend

[DATA] Implement zero-copied string dtype and accelerate shuffle. #149

Test Results