OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

Scrape SET Annual Report [LM-232] #329

Open boss-chanon opened 9 months ago

boss-chanon commented 9 months ago

Why this PR

Scrape data from SET Annual Report pipeline,

Changes

Related Issues

Close #

Checklist

linear[bot] commented 9 months ago
LM-232 Scrape SET Annual Report

**Rationale**: We would like to use SET Annual Report to train the model **Original source format**: PDF **Step by Step** 1. Download the listed companies in SET can be found [here](https://drive.google.com/file/d/1OGrFQ6Cpp3-olLMiqwsNGpeQZfeMyNrc/view?usp=sharing) 2. Check the link to scrape the News and download by listed companies 1. Example link [here](https://www.set.or.th/th/market/product/stock/quote/PTT/company-profile/information) 2. Download in section of แบบแสดงรายการข้อมูลประจำปี/รายงานประจำปี 3. The data should be collected 5 years back (2023-2018) 3. Write a Python script to escape the PDF file(Using BeautifulSoup or any related libraries) 4. Extracting text from PDF to String (Using PyPDF or any related libraries) 5. Convert text into JSONL Structure 6. Pull request into Our Github repository **Reviewer:** kwankoravich

codecov[bot] commented 9 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (5fbee96) 64.16% compared to head (5239299) 64.16%. Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #329 +/- ## ======================================= Coverage 64.16% 64.16% ======================================= Files 11 11 Lines 427 427 ======================================= Hits 274 274 Misses 153 153 ``` | [Flag](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/329/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/329/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | `64.16% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.