Open boss-chanon opened 12 months ago
**Rationale**: We would like to use SET Annual Report to train the model **Original source format**: PDF **Step by Step** 1. Download the listed companies in SET can be found [here](https://drive.google.com/file/d/1OGrFQ6Cpp3-olLMiqwsNGpeQZfeMyNrc/view?usp=sharing) 2. Check the link to scrape the News and download by listed companies 1. Example link [here](https://www.set.or.th/th/market/product/stock/quote/PTT/company-profile/information) 2. Download in section of แบบแสดงรายการข้อมูลประจำปี/รายงานประจำปี 3. The data should be collected 5 years back (2023-2018) 3. Write a Python script to escape the PDF file(Using BeautifulSoup or any related libraries) 4. Extracting text from PDF to String (Using PyPDF or any related libraries) 5. Convert text into JSONL Structure 6. Pull request into Our Github repository **Reviewer:** kwankoravich
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
5fbee96
) 64.16% compared to head (5239299
) 64.16%. Report is 1 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Why this PR
Scrape data from SET Annual Report pipeline,
Changes
Related Issues
Close #
Checklist