Develop scripts to efficiently scrape Tibetan news articles from multiple sources, starting with the Voice of Tibet (VOT) website, and store them in a structured format for training a machine translation model.
We need Tibetan news articles for training our machine translation model. This task involves creating scripts to collect articles from various Tibetan news websites, beginning with VOT, and organizing them in a clear, structured format.
The scraped article link for each page is stored in a dictionary with the following structure:
{
"Links": List[],
"Message": string,
"Response": int
}
The scraped data for each article is stored in a dictionary with the following structure:
{
"data": {
"title": str,
"body": {
"Audio": str,
"Text": List[str]
},
"meta_data": {
"Author": str,
"Date": str,
"Tags": List[str],
"URL": str
}
},
"Message": str,
"Response": int
}
Language Translation format:
translation_format = {
"data": {
"English": {
'Word': "",
'POS': "",
'Sentence': ""
},
"Tibetan": {
'Word': "",
'phonetic': "",
'Sentence': ""
},
"czech": {
'Word': "",
'Sentence': ""
},
"meta_data": {
"Comment": "",
"Source": ""
},
"Message": "Success"
},
"Message": "Success",
"Response": 200
}
Note: Taking website example as VOT