OpenPecha / tibetan-news-article-scraping

0 stars 0 forks source link

MT0026: Tibetan news article scraping #1

Open tenzinchoedon opened 3 months ago

tenzinchoedon commented 3 months ago

Objective

Develop scripts to efficiently scrape Tibetan news articles from multiple sources, starting with the Voice of Tibet (VOT) website, and store them in a structured format for training a machine translation model.

Description

We need Tibetan news articles for training our machine translation model. This task involves creating scripts to collect articles from various Tibetan news websites, beginning with VOT, and organizing them in a clear, structured format.

Data link:

Completion Criteria

Tibetan News Websites to be Extracted

Tibetan to English Translation Websites to be Extracted:

Subtasks

  1. Implement a function to collect All article links from Website
  2. Implement a function to extract detailed information from individual articles links
  3. Extend the existing code to handle other Tibetan news websites
  4. Organize the collected news articles in a clear and structured format in JSON format

Data Structure

The scraped article link for each page is stored in a dictionary with the following structure:

{
    "Links": List[],
    "Message": string,
    "Response": int
}

The scraped data for each article is stored in a dictionary with the following structure:

{
    "data": {
        "title": str,
        "body": {
            "Audio": str,
            "Text": List[str]
        },
        "meta_data": {
            "Author": str,
            "Date": str,
            "Tags": List[str],
            "URL": str
        }
    },
    "Message": str,
    "Response": int
}

Language Translation format:

translation_format = {
        "data": {
            "English": {
                    'Word': "",
                    'POS': "",
                    'Sentence': ""
                },
                "Tibetan": {
                    'Word': "",
                    'phonetic': "",
                    'Sentence': ""
                },
                "czech": {
                    'Word': "",
                    'Sentence': ""
                },
                "meta_data": {
                    "Comment": "",
                    "Source": ""
                },
                "Message": "Success"
        },
        "Message": "Success",
        "Response": 200
    }

Implementation Details

Note: Taking website example as VOT

extract_all_vot_article Function

scrape_vot_article Function

Key Features

Implementation Notes

Resources

tenzinchoedon commented 3 months ago

@kaldan007 there are around 200 pages in each category of the vot page and 8-9 articles are there in each page, so around how much of it do i need to extract?

tenzinchoedon commented 3 months ago

the output below includes the link to the audio file if it's given in the article and if there's no audio file provided in the article , the output will print "No audio tag found"

Image

kaldan007 commented 3 months ago

expected output:

{
'title':xyx,
'body': {
            'text':[para1, para2],
            'audio':'audiolink'
          }
'meta': {
     'date': 
     'src_url': 
     'tags': ['tag1, tag2]
     'other': 
    }
}

save it in a google drive. notify @TenzinGayche

kaldan007 commented 3 months ago

https://www.gyalwarinpoche.com/