Open uchihatashi opened 2 months ago
@TenzinGayche @kaldan007 Two of the websites are currently not loading. Kindly verify.
All files are pushed in (s3://tibetan-news-data/)
Latest/current news articles in folder new_news_Articles
@TenzinGayche @kaldan007
@kaldan007
All the files have been updated to the tibetan-news-data
repository, with a total size of 2GB.
@kaldan007 @TenzinGayche
All the website are extracted and pushed to S3. Remaining website tbwriters and teducn are pushed to s3.
@kaldan007 @TenzinGayche
Description:
We have several websites containing Tibetan literature data that need to be scraped to gather as much valuable information as possible for training our LLM. The task involves not only extracting the core data but also collecting comprehensive metadata, which includes date, genre/categories/tags, other relevant details.
this work is continuation of MT0026
Websites :
[x] མཆོད་མེ་བོད་ཀྱི་རྩོམ་རིག་དྲ་བ། https://www.tibetcm.com/ (Done)
all categories in one shot
last page as [མཇུག་ངོས།] with correct index
[x] རྩོམ་པ་པོ་དྲ་བ། https://www.tbwriters.com/ (Extraction not working)
[x] ན་གཞོན་གསར་པ། www.tibyouth.com (Not loading)
[x] ཁ་བརྡ་དྲ་བ། https://www.khabdha.org/ (Done)
simple process
old technique
[x] ཁམས་པའི་དྲ་བ། http://ti.kbcmw.com/html/WenXue/ (Done)
sub categories
last page with reversed index
[ ] ཡུལ་ལྷུང་བླ་མཚོ།
[x] མཚོ་སྔོན་བོད་ཡིག་ཟིན་བྲིས། http://blog.amdotibet.cn/ (Not loading)
[ ] དམུ་རྒོད་རྩོམ་རིགས་དྲ་བ།
[x] ཀྲུང་གོའི་བོད་ཀྱི་དྲ་བ། http://tb.tibet.cn/tb/literature/ (Done)
simple process
old technique
[x] http://www.teducn.com/
[x] https://ti.zangdiyg.com/ (95% done)
Has multiple sub categories within sub categories
Page with correct index [78] (end page)
some page too huge with 8,000 records taking 24hours+
[x] https://www.tb1025.cn/ (Done)
has multiple sub within sub
lots of 403 issues due to ip-block
[x] http://www.shangri-latibet.cn/ (Done)
menu categories only, page JS but has hidden last record.
multiple server issue like: "An error occurred while fetching the article: 500 Server Error: Internal Server Error for url:"
[x] http://www.tbmgar.com/ (Done)
sub categories
simple page number
[x] https://tb.kangbatv.com/ (Done)
menu and sub menu
[x] http://xizang.news.cn/ (not loading)
JSON base return.
Website loading but any of the articles not loading.
[x] http://www.tibetcnr.com/ (Done)
simple process
new technique
[x] https://tb.xzxw.com/ (Done)
menu with sub.
all page number at end
[x] http://www.jmjzjy.com/index.asp?Zcyr=Sherap_TB
[x] https://sertha.net/ (Done)
only menu
new technique
[x] http://www.tongdrol.com/
Issues with page load
[x] https://bo.wikipedia.org/wiki/%E0%BD%82%E0%BD%99%E0%BD%BC%E0%BC%8B%E0%BD%84%E0%BD%BC%E0%BD%A6%E0%BC%8D (Done)
new process
new data structure