OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

Pipeline for Scrape Polity Data from Krisdika [LM-237] #334

Closed boss-chanon closed 7 months ago

boss-chanon commented 9 months ago

Why this PR

Scrape polity data from Krisdika.

Changes

Related Issues

Close #

Checklist

linear[bot] commented 9 months ago
LM-237 Scrape Official Of the Council State (สำนักงานคณะกรรมการกฤษฎีกา) - รัฐธรรมนูญ

**Rationale**: Using Official Of the Council State (สำนักงานคณะกรรมการกฤษฎีกา) as a part of the Law dataset to expand our pre-trained model knowledge-based. **Step by Step** 1. Download data from this website: [รัฐธรรมนูญ](https://www.krisdika.go.th/web/guest/law?p_p_id=LawPortlet_INSTANCE_aAN7C2U5hENi&p_p_state=normal&p_p_mode=view&\_LawPortlet_INSTANCE_aAN7C2U5hENi_javax.portlet.action=selectLawTypeMenu&\_LawPortlet_INSTANCE_aAN7C2U5hENi_lawTypeId=1&p_auth=Fxeer5Zp&p_p_lifecycle=0) 2. We can scrape information in HTML format 3. Download only Full version and the last Updated version (e.g. รัฐธรรมนูญแห่งราชอาณาจักรไทย, รัฐธรรมนูญแห่งราชอาณาจักรไทย 2557 ฉบับ Update ล่าสุด) 4. Convert data into JSONL format [image.png](https://uploads.linear.app/03a3f0b5-8e51-4d0f-918c-59e891b8184f/22c6dff0-50fa-475e-abbe-ffdaec4b6416/3de4a639-81ac-4a35-82ac-3a67c5bf6097) 5. Pull request to our GitHub repository Reviewers kwankoravich

codecov[bot] commented 9 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (3b893c9) 64.16% compared to head (f9b3b08) 64.16%. Report is 9 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #334 +/- ## ======================================= Coverage 64.16% 64.16% ======================================= Files 11 11 Lines 427 427 ======================================= Hits 274 274 Misses 153 153 ``` | [Flag](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/334/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/334/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | `64.16% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

boss-chanon commented 7 months ago

This PR can't be used because the UI was changed.