Scrape SET Annual Report [LM-232]

boss-chanon commented 12 months ago

Why this PR

Scrape data from SET Annual Report pipeline,

Download the listed companies in SET can be found here

Changes

Add pipeline for scrape data from SET Annual Report
Add pipeline for convert pdf data to jsonl

Related Issues

Close #

Checklist

[ ] PR should be in the Naming convention
[ ] Assign yourself in to Assigneees
[ ] Tag related issues
[ ] Constants name should be ALL_CAPITAL, function name should be snake_case, and class name should be CamelCase
[ ] complex function/algorithm should have Docstring
[ ] 1 PR should not have more than 200 lines changes (Exception for test files). If more than that please open multiple PRs
[ ] At least PR reviewer must come from the task's team (model, eval, data)

linear[bot] commented 12 months ago

LM-232 Scrape SET Annual Report

**Rationale**: We would like to use SET Annual Report to train the model **Original source format**: PDF **Step by Step** 1. Download the listed companies in SET can be found [here](https://drive.google.com/file/d/1OGrFQ6Cpp3-olLMiqwsNGpeQZfeMyNrc/view?usp=sharing) 2. Check the link to scrape the News and download by listed companies 1. Example link [here](https://www.set.or.th/th/market/product/stock/quote/PTT/company-profile/information) 2. Download in section of แบบแสดงรายการข้อมูลประจำปี/รายงานประจำปี 3. The data should be collected 5 years back (2023-2018) 3. Write a Python script to escape the PDF file(Using BeautifulSoup or any related libraries) 4. Extracting text from PDF to String (Using PyPDF or any related libraries) 5. Convert text into JSONL Structure 6. Pull request into Our Github repository **Reviewer:** kwankoravich

codecov[bot] commented 12 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (5fbee96) 64.16% compared to head (5239299) 64.16%. Report is 1 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #329 +/- ## ======================================= Coverage 64.16% 64.16% ======================================= Files 11 11 Lines 427 427 ======================================= Hits 274 274 Misses 153 153 ``` | [Flag](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/329/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/329/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | `64.16% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

OpenThaiGPT / openthaigpt-pretraining