OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

Pipeline for Scrape King Data from Krisdika [LM-238] #335

Closed boss-chanon closed 7 months ago

boss-chanon commented 9 months ago

Why this PR

Scrape king data from Krisdika. This PR continue from #334

Changes

Related Issues

Close #

Checklist

linear[bot] commented 9 months ago
LM-238 Scrape Official Of the Council State (สำนักงานคณะกรรมการกฤษฎีกา) - พระราชบัญญัติ

**Rationale**: Using Official Of the Council State (สำนักงานคณะกรรมการกฤษฎีกา) as a part of the Law dataset to expand our pre-trained model knowledge based. **Step by Step** 1. Download data from this website: [พระราชบัญญัติ](https://www.krisdika.go.th/web/guest/law?p_p_id=LawPortlet_INSTANCE_aAN7C2U5hENi&p_p_state=normal&p_p_mode=view&\_LawPortlet_INSTANCE_aAN7C2U5hENi_javax.portlet.action=selectLawTypeMenu&\_LawPortlet_INSTANCE_aAN7C2U5hENi_lawTypeId=2&p_auth=Fxeer5Zp&p_p_lifecycle=0) 2. Select "ตามหัวเรื่อง" and Download every topic/sub-topic as we can 3. Scrape all document that has been available on the website 4. Extract text in PDF file 5. Convert data into JSONL format [image.png](https://uploads.linear.app/03a3f0b5-8e51-4d0f-918c-59e891b8184f/fddb5be0-0169-44b8-ab8b-e88666fd8777/475103ae-d36d-4170-9b68-44d87fe1ec01) 6. Pull request to our GitHub repository Reviewers kwankoravich

codecov[bot] commented 9 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (3b893c9) 64.16% compared to head (cf01588) 64.16%. Report is 9 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #335 +/- ## ======================================= Coverage 64.16% 64.16% ======================================= Files 11 11 Lines 427 427 ======================================= Hits 274 274 Misses 153 153 ``` | [Flag](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/335/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/335/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | `64.16% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

boss-chanon commented 7 months ago

This PR can't be used because the UI was changed.