OpenPecha / tibetan-news-article-scraping

0 stars 0 forks source link

Objective

Develop scripts to efficiently scrape Tibetan news articles from multiple sources, starting with the Voice of Tibet (VOT) website, and store them in a structured format for training a machine translation model.

Description

We need Tibetan news articles for training our machine translation model. This task involves creating scripts to collect articles from various Tibetan news websites, beginning with VOT, and organizing them in a clear, structured format.

Data link:

Completion Criteria

Tibetan News Websites to be Extracted

Tibetan to English Translation Websites to be Extracted:

Subtasks

  1. Implement a function to collect All article links from Website
  2. Implement a function to extract detailed information from individual articles links
  3. Extend the existing code to handle other Tibetan news websites
  4. Organize the collected news articles in a clear and structured format in JSON format

Data Structure

The scraped article link for each page is stored in a dictionary with the following structure:

{
    "Links": List[],
    "Message": string,
    "Response": int
}

The scraped data for each article is stored in a dictionary with the following structure:

{
    "data": {
        "title": str,
        "body": {
            "Audio": str,
            "Text": List[str]
        },
        "meta_data": {
            "Author": str,
            "Date": str,
            "Tags": List[str],
            "URL": str
        }
    },
    "Message": str,
    "Response": int
}

Language Translation format:

translation_format = {
        "data": {
            "English": {
                    'Word': "",
                    'POS': "",
                    'Sentence': ""
                },
                "Tibetan": {
                    'Word': "",
                    'phonetic': "",
                    'Sentence': ""
                },
                "czech": {
                    'Word': "",
                    'Sentence': ""
                },
                "meta_data": {
                    "Comment": "",
                    "Source": ""
                },
                "Message": "Success"
        },
        "Message": "Success",
        "Response": 200
    }

Implementation Details

Note: Taking website example as VOT

extract_all_vot_article Function

scrape_vot_article Function

Key Features

Implementation Notes

Resources