OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

Mond/internet preprocessing doc #302

Closed Chawak closed 11 months ago

Chawak commented 1 year ago

Why this PR?

README.md documentation for internet data pre-processing

Changes

Related Issues

Close #

Checklist

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage has no change and project coverage change: -0.80% :warning:

Comparison is base (c441682) 94.95% compared to head (6dd7d68) 94.15%. Report is 90 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #302 +/- ## ========================================== - Coverage 94.95% 94.15% -0.80% ========================================== Files 12 10 -2 Lines 337 291 -46 ========================================== - Hits 320 274 -46 Misses 17 17 ``` | [Flag](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/302/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/302/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | `94.15% <ø> (-0.80%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files Changed](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/302?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | Coverage Δ | | |---|---|---| | [...enthaigpt\_pretraining\_data/internet/mc4/pattern.py](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/302?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT#diff-c3JjL2RhdGEvb3BlbnRoYWlncHRfcHJldHJhaW5pbmdfZGF0YS9pbnRlcm5ldC9tYzQvcGF0dGVybi5weQ==) | `100.00% <ø> (ø)` | | | [...haigpt\_pretraining\_data/internet/mc4/preprocess.py](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/302?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT#diff-c3JjL2RhdGEvb3BlbnRoYWlncHRfcHJldHJhaW5pbmdfZGF0YS9pbnRlcm5ldC9tYzQvcHJlcHJvY2Vzcy5weQ==) | `95.23% <ø> (ø)` | | | [...haigpt\_pretraining\_data/internet/oscar/keywords.py](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/302?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT#diff-c3JjL2RhdGEvb3BlbnRoYWlncHRfcHJldHJhaW5pbmdfZGF0YS9pbnRlcm5ldC9vc2Nhci9rZXl3b3Jkcy5weQ==) | `100.00% <ø> (ø)` | | | [...igpt\_pretraining\_data/internet/oscar/preprocess.py](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/302?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT#diff-c3JjL2RhdGEvb3BlbnRoYWlncHRfcHJldHJhaW5pbmdfZGF0YS9pbnRlcm5ldC9vc2Nhci9wcmVwcm9jZXNzLnB5) | `100.00% <ø> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.