Develop scripts to efficiently scrape Tibetan news articles from multiple sources, starting with the Voice of Tibet (VOT) website, and store them in a structured format for training a machine translation model.
Description
We need Tibetan news articles for training our machine translation model. This task involves creating scripts to collect articles from various Tibetan news websites, beginning with VOT, and organizing them in a clear, structured format.
Completion Criteria
Scripts developed that can efficiently scrape all the Tibetan news articles from VOT and other sources.
Collected articles stored in a structured format (JSON) suitable for use in machine translation training.
Objective
Develop scripts to efficiently scrape Tibetan news articles from multiple sources, starting with the Voice of Tibet (VOT) website, and store them in a structured format for training a machine translation model.
Description
We need Tibetan news articles for training our machine translation model. This task involves creating scripts to collect articles from various Tibetan news websites, beginning with VOT, and organizing them in a clear, structured format.
Completion Criteria
Tibetan News Websites
Subtasks
Data Structure
The scraped article link for each page is stored in a dictionary with the following structure:
The scraped data for each article is stored in a dictionary with the following structure:
Implementation Details
extract_all_article Function
scrape_vot_article Function
Key Features
Implementation Notes
Resources