OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

Mond/refactor internet LM-206 #319

Closed Chawak closed 10 months ago

Chawak commented 10 months ago

Why this PR

Why we need this PR? This PR is for this issue https://linear.app/openthaigpt/issue/LM-206/refactor-common-crawl-dataset-pipeline-to-use-the-latest-metadatajson

Changes

Related Issues

Close #

Checklist

linear[bot] commented 10 months ago
LM-206 Refactor Common Crawl Dataset Pipeline to use the latest metadata.json schema

Please DO: 1. Make `NUM_PROC` adjustable in the Hydra configuration file (`_config.yaml)` 2. Refactor all `metadata.json, info.json, _config.yaml` create & read logic into [src/data/openthaigpt_pretraining_data/core](https://github.com/OpenThaiGPT/openthaigpt-pretraining/tree/main/src/data/openthaigpt_pretraining_data/core)/metadata.py 3. Convert `metadata.json` to support the latest schema 4. push `core.zip` into DVC `Before running. please download core.zip from this link and extract it in` Contact new17353 for DVC push credentials 5. Edit readme to adhere to all changes [https://github.com/OpenThaiGPT/openthaigpt-pretraining/tree/main/src/data/scripts/internet](https://github.com/OpenThaiGPT/openthaigpt-pretraining/tree/main/src/data/scripts/internet) [image.png](https://uploads.linear.app/03a3f0b5-8e51-4d0f-918c-59e891b8184f/ffb51466-fa08-486c-9611-e6db88834428/0ab34655-017b-4327-9845-5cc36adc8134)

codecov[bot] commented 10 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (a9ec1c7) 94.15% compared to head (149eec8) 94.15%. Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #319 +/- ## ======================================= Coverage 94.15% 94.15% ======================================= Files 10 10 Lines 291 291 ======================================= Hits 274 274 Misses 17 17 ``` | [Flag](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/319/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/319/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | `94.15% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.