Objective

Develop scripts to efficiently scrape Tibetan news articles from multiple sources, starting with the Voice of Tibet (VOT) website, and store them in a structured format for training a machine translation model.

Description

We need Tibetan news articles for training our machine translation model. This task involves creating scripts to collect articles from various Tibetan news websites, beginning with VOT, and organizing them in a clear, structured format.

Completion Criteria

Scripts developed that can efficiently scrape all the Tibetan news articles from VOT and other sources.
Collected articles stored in a structured format (JSON) suitable for use in machine translation training.
Collect audio and meta-data if available.

Tibetan News Websites

https://vot.org/ (implemented)
https://tibettimes.net/ (implemented)
https://www.voatibetan.com/ (on-Process)
https://www.rfa.org/tibetan (on-Process)
http://bangchen.net/ (dead site)
bangchen.tibetexpress.net (on-Process)
https://www.gyalwarinpoche.com/

Subtasks

Implement a function to collect article links from Website
Implement a function to extract detailed information from individual articles
Extend the existing code to handle other Tibetan news websites
Organize the collected articles in a clear and structured format in JSON format

Data Structure

The scraped article link for each page is stored in a dictionary with the following structure:

{
    "Links": List[],
    "Message": string,
    "Response": int
}

The scraped data for each article is stored in a dictionary with the following structure:

{
    "data": {
        "title": str,
        "body": {
            "Audio": str,
            "Text": List[str]
        },
        "meta_data": {
            "Author": str,
            "Date": str,
            "Tags": List[str],
            "URL": str
        }
    },
    "Message": str,
    "Response": int
}

Implementation Details

extract_all_article Function

Purpose: Extracts all article links from a given VOT webpage
Input: URL of the VOT webpage
Output: Dictionary containing a list of article links, status message, and response code

scrape_vot_article Function

Purpose: Scrapes detailed information from a single VOT article
Input: URL of the specific VOT article
Output: Dictionary containing article data (title, body, metadata), status message, and response code

Key Features

User-Agent header to mimic browser requests
Error handling for various scenarios (timeout, request exceptions, parsing errors)
Extraction of article title, author, date, tags, text content, and audio source (if available)

Implementation Notes

The current implementation focuses on the VOT website. Extend the code to handle other Tibetan news websites.
Ensure that the scraping script respects each website's robots.txt file and implements appropriate delays between requests.
Implement error logging to track any issues during the scraping process.
Consider implementing incremental scraping to avoid duplicating content and to efficiently update the dataset over time.

Resources

Beautiful Soup library for HTML parsing
Requests library for making HTTP requests
Time library for implementing delays and tracking request duration

OpenPecha / tibetan-news-article-scraping

DA009: Tibetan news article scraping #2