[DATA] Refine data deduplication and add example.

DeepRec-AI / HybridBackend

A high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster

Apache License 2.0

156 stars 30 forks source link

[DATA] Refine data deduplication and add example. #135

Closed francktcheng closed 1 year ago

francktcheng commented 1 year ago

Refine to fix cornercases in data deduplication.
Add a script to prepare deduplication data files.
Test data deduplication on Dcnv2 model with taobao dataset.

End-to-end performance on a single A100 GPU with one day of Taobao dataset and a batch size of 64000 obtains an improvement of training throughput around 1.3x.