OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

Scrape Admincourt Pipeline [LM-244] #332

Open boss-chanon opened 9 months ago

boss-chanon commented 9 months ago

Why this PR

Make pipeline for scrape data from admincourt

Changes

Related Issues

Close #

Checklist

linear[bot] commented 9 months ago
LM-244 Scrape The Administrative Court (ศาลปกครอง) - บทความวิชาการ

**Rationale**: Using The Administrative Court (ศาลปกครอง) as a part of the Law dataset to expand our pre-trained model knowledge based. **Step by Step** 1. Download data from this website: [บทความวิชาการ](https://www.admincourt.go.th/admincourt/site/09articleacademic.html) 2. We can scrape information in PDF format 3. Scrape all document that has been available in this sub-section 4. Exclude คำนำ, สารบัญ ออก 5. Convert data into JSONL format [image.png](https://uploads.linear.app/03a3f0b5-8e51-4d0f-918c-59e891b8184f/22c6dff0-50fa-475e-abbe-ffdaec4b6416/3de4a639-81ac-4a35-82ac-3a67c5bf6097) 6. Pull request to our GitHub repository Reviewers kwankoravich

codecov[bot] commented 9 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (4d5c647) 64.47% compared to head (646c304) 64.47%. Report is 9 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #332 +/- ## ======================================= Coverage 64.47% 64.47% ======================================= Files 11 11 Lines 425 425 ======================================= Hits 274 274 Misses 151 151 ``` | [Flag](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/332/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/332/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | `64.47% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.