PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12k stars 2.93k forks source link

[Cherry Pick] Fix distributed batch sampler #8691

Closed JunnYu closed 3 months ago

JunnYu commented 3 months ago

PR types

Bug fixes

PR changes

APIs

Description

num_samples 向下去整, 防止prefrech预取时候超过数据集最大长度...

# 该情况下,计算存在问题,当向上去整的时候2848会超过数据集的最大长度2844
len(self.dataset) = 2844
self.nranks = 8
int( len(self.dataset)* 1.0 / self.nranks) * self.nranks = 2840
int(ceil(len(self.dataset)* 1.0 / self.nranks)) * self.nranks = 2848
# 该情况下计算不会有问题,因为整除了
len(self.dataset) = 2844
self.nranks = 4
int( len(self.dataset)* 1.0 / self.nranks) * self.nranks = 2844
int(ceil(len(self.dataset)* 1.0 / self.nranks)) * self.nranks = 2844
paddle-bot[bot] commented 3 months ago

Thanks for your contribution!

codecov[bot] commented 2 days ago

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 55.20%. Comparing base (0f428bb) to head (1926e7b). Report is 55 commits behind head on release/2.8.

Files with missing lines Patch % Lines
paddlenlp/utils/batch_sampler.py 0.00% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## release/2.8 #8691 +/- ## =============================================== + Coverage 55.14% 55.20% +0.06% =============================================== Files 608 611 +3 Lines 94580 95055 +475 =============================================== + Hits 52158 52478 +320 - Misses 42422 42577 +155 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.