OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

feat(model): Combined dataset to HF pipeline #316

Closed boss-chanon closed 11 months ago

boss-chanon commented 11 months ago

Why this PR

merge dataset in hf dataset by can set weight of dataset

Changes

Related Issues

Close #

Checklist

new5558 commented 11 months ago

Can you test is this data loader actually can load two datasets with the specified weight

codecov[bot] commented 11 months ago

Codecov Report

All modified lines are covered by tests :white_check_mark:

Comparison is base (2f8f284) 94.15% compared to head (5a27d13) 19.39%. Report is 30 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #316 +/- ## =========================================== - Coverage 94.15% 19.39% -74.77% =========================================== Files 10 25 +15 Lines 291 1392 +1101 =========================================== - Hits 274 270 -4 - Misses 17 1122 +1105 ``` | [Flag](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/316/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/316/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | `19.39% <ø> (-74.77%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT#carryforward-flags-in-the-pull-request-comment) to find out more. [see 35 files with indirect coverage changes](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/316/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

boss-chanon commented 11 months ago

Can you test is this data loader actually can load two datasets with the specified weight

after test this solution don't exact match by weight and some output data are duplicate.

new5558 commented 11 months ago

Can you test is this data loader actually can load two datasets with the specified weight

after test this solution don't exact match by weight and some output data are duplicate.

Approve krub