Open tenzinchoedon opened 3 months ago
@kaldan007 there are around 200 pages in each category of the vot page and 8-9 articles are there in each page, so around how much of it do i need to extract?
the output below includes the link to the audio file if it's given in the article and if there's no audio file provided in the article , the output will print "No audio tag found"
expected output:
{
'title':xyx,
'body': {
'text':[para1, para2],
'audio':'audiolink'
}
'meta': {
'date':
'src_url':
'tags': ['tag1, tag2]
'other':
}
}
save it in a google drive. notify @TenzinGayche
Objective
Develop scripts to efficiently scrape Tibetan news articles from multiple sources, starting with the Voice of Tibet (VOT) website, and store them in a structured format for training a machine translation model.
Description
We need Tibetan news articles for training our machine translation model. This task involves creating scripts to collect articles from various Tibetan news websites, beginning with VOT, and organizing them in a clear, structured format.
Data link:
Completion Criteria
Tibetan News Websites to be Extracted
Tibetan to English Translation Websites to be Extracted:
Subtasks
Data Structure
The scraped article link for each page is stored in a dictionary with the following structure:
The scraped data for each article is stored in a dictionary with the following structure:
Language Translation format:
Implementation Details
Note: Taking website example as VOT
extract_all_vot_article Function
scrape_vot_article Function
Key Features
Implementation Notes
Resources