PMA0009: Scraping Tibetan creative writing websites(MM24)

uchihatashi commented 2 months ago

Description:

We have several websites containing Tibetan literature data that need to be scraped to gather as much valuable information as possible for training our LLM. The task involves not only extracting the core data but also collecting comprehensive metadata, which includes date, genre/categories/tags, other relevant details.

this work is continuation of MT0026

Websites :

[x] མཆོད་མེ་བོད་ཀྱི་རྩོམ་རིག་དྲ་བ། https://www.tibetcm.com/ (Done)
all categories in one shot
last page as [མཇུག་ངོས།] with correct index
[x] རྩོམ་པ་པོ་དྲ་བ། https://www.tbwriters.com/ (Extraction not working)
[x] ན་གཞོན་གསར་པ། www.tibyouth.com (Not loading)
[x] ཁ་བརྡ་དྲ་བ། https://www.khabdha.org/ (Done)
simple process
old technique
[x] ཁམས་པའི་དྲ་བ། http://ti.kbcmw.com/html/WenXue/ (Done)
sub categories
last page with reversed index
[ ] ཡུལ་ལྷུང་བླ་མཚོ།
[x] མཚོ་སྔོན་བོད་ཡིག་ཟིན་བྲིས། http://blog.amdotibet.cn/ (Not loading)
[ ] དམུ་རྒོད་རྩོམ་རིགས་དྲ་བ།
[x] ཀྲུང་གོའི་བོད་ཀྱི་དྲ་བ། http://tb.tibet.cn/tb/literature/ (Done)
simple process
old technique
[x] http://www.teducn.com/
[x] https://ti.zangdiyg.com/ (95% done)
Has multiple sub categories within sub categories
Page with correct index [78] (end page)
some page too huge with 8,000 records taking 24hours+
[x] https://www.tb1025.cn/ (Done)
has multiple sub within sub
lots of 403 issues due to ip-block
[x] http://www.shangri-latibet.cn/ (Done)
menu categories only, page JS but has hidden last record.
multiple server issue like: "An error occurred while fetching the article: 500 Server Error: Internal Server Error for url:"
[x] http://www.tbmgar.com/ (Done)
sub categories
simple page number
[x] https://tb.kangbatv.com/ (Done)
menu and sub menu
[x] http://xizang.news.cn/ (not loading)
JSON base return.
Website loading but any of the articles not loading.
[x] http://www.tibetcnr.com/ (Done)
simple process
new technique
[x] https://tb.xzxw.com/ (Done)
menu with sub.
all page number at end
[x] http://www.jmjzjy.com/index.asp?Zcyr=Sherap_TB
[x] https://sertha.net/ (Done)
only menu
new technique
[x] http://www.tongdrol.com/
Issues with page load
[x] https://bo.wikipedia.org/wiki/%E0%BD%82%E0%BD%99%E0%BD%BC%E0%BC%8B%E0%BD%84%E0%BD%BC%E0%BD%A6%E0%BC%8D (Done)
new process
new data structure