PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12k stars 2.93k forks source link

[BUG] num_samples 向下去整, 防止prefrech预取时候超过数据集最大长度... #8690

Closed JunnYu closed 3 months ago

JunnYu commented 3 months ago

PR types

Bug fixes

PR changes

APIs

Description

num_samples 向下去整, 防止prefrech预取时候超过数据集最大长度...

# 该情况下,计算存在问题,当向上去整的时候2848会超过数据集的最大长度2844
len(self.dataset) = 2844
self.nranks = 8
int( len(self.dataset)* 1.0 / self.nranks) * self.nranks = 2840
int(ceil(len(self.dataset)* 1.0 / self.nranks)) * self.nranks = 2848
# 该情况下计算不会有问题,因为整除了
len(self.dataset) = 2844
self.nranks = 4
int( len(self.dataset)* 1.0 / self.nranks) * self.nranks = 2844
int(ceil(len(self.dataset)* 1.0 / self.nranks)) * self.nranks = 2844
paddle-bot[bot] commented 3 months ago

Thanks for your contribution!

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 55.61%. Comparing base (2723138) to head (a2094dc). Report is 230 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/utils/batch_sampler.py 0.00% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #8690 +/- ## =========================================== - Coverage 55.61% 55.61% -0.01% =========================================== Files 620 620 Lines 96965 96964 -1 =========================================== - Hits 53930 53929 -1 Misses 43035 43035 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.