OpenPecha / tibetan-news-article-scraping

0 stars 0 forks source link

PMA0009: Scraping Tibetan creative writing websites(MM24) #4

Open uchihatashi opened 2 months ago

uchihatashi commented 2 months ago

Description:

We have several websites containing Tibetan literature data that need to be scraped to gather as much valuable information as possible for training our LLM. The task involves not only extracting the core data but also collecting comprehensive metadata, which includes date, genre/categories/tags, other relevant details.

this work is continuation of MT0026

Websites :

uchihatashi commented 2 months ago

@TenzinGayche @kaldan007 Two of the websites are currently not loading. Kindly verify.

  1. http://blog.amdotibet.cn/ (Not loading)
  2. www.tibyouth.com (Not loading)
uchihatashi commented 2 months ago

All files are pushed in (s3://tibetan-news-data/)

Image

Latest/current news articles in folder new_news_Articles

@TenzinGayche @kaldan007

uchihatashi commented 1 month ago

@kaldan007

uchihatashi commented 1 month ago

Image

Image

Image

All the files have been updated to the tibetan-news-data repository, with a total size of 2GB.

@kaldan007 @TenzinGayche

uchihatashi commented 1 month ago

Image All the website are extracted and pushed to S3. Remaining website tbwriters and teducn are pushed to s3.

@kaldan007 @TenzinGayche