OpenPecha / tibetan-news-article-scraping

0 stars 0 forks source link

DA009: Tibetan news article scraping #2

Open uchihatashi opened 2 months ago

uchihatashi commented 2 months ago

Objective

Develop scripts to efficiently scrape Tibetan news articles from multiple sources, starting with the Voice of Tibet (VOT) website, and store them in a structured format for training a machine translation model.

Description

We need Tibetan news articles for training our machine translation model. This task involves creating scripts to collect articles from various Tibetan news websites, beginning with VOT, and organizing them in a clear, structured format.

Completion Criteria

Tibetan News Websites

Subtasks

  1. Implement a function to collect article links from Website
  2. Implement a function to extract detailed information from individual articles
  3. Extend the existing code to handle other Tibetan news websites
  4. Organize the collected articles in a clear and structured format in JSON format

Data Structure

The scraped article link for each page is stored in a dictionary with the following structure:

{
    "Links": List[],
    "Message": string,
    "Response": int
}

The scraped data for each article is stored in a dictionary with the following structure:

{
    "data": {
        "title": str,
        "body": {
            "Audio": str,
            "Text": List[str]
        },
        "meta_data": {
            "Author": str,
            "Date": str,
            "Tags": List[str],
            "URL": str
        }
    },
    "Message": str,
    "Response": int
}

Implementation Details

extract_all_article Function

scrape_vot_article Function

Key Features

Implementation Notes

Resources