Open GasimV opened 5 months ago
Here is a detailed breakdown of the web scraping approach to collect Azerbaijani speech and text data for training speech recognition and text-to-speech models:
Identify websites that contain Azerbaijani speech content with text transcriptions or subtitles, such as:
Use web scraping tools and libraries such as:
Install the necessary Python packages:
pip install beautifulsoup4 scrapy selenium youtube-dl
Step-by-Step Process:
youtube-dl
to download subtitles.
youtube-dl --write-sub --sub-lang az --skip-download <video_url>
This command downloads Azerbaijani subtitles without downloading the video.
youtube-dl
to download audio.
youtube-dl --extract-audio --audio-format wav <video_url>
Example Code to Parse Subtitles:
import re
def parse_srt(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
subtitles = re.findall(r'\d+\n(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})\n(.+)', content)
parsed_subtitles = []
for subtitle in subtitles:
start_time = subtitle[0]
end_time = subtitle[1]
text = subtitle[2].replace('\n', ' ')
parsed_subtitles.append((start_time, end_time, text))
return parsed_subtitles
Step-by-Step Process:
Identify Static Content: Find news websites, audiobooks, and podcasts in Azerbaijani.
Scrape HTML:
Use requests
and BeautifulSoup
to scrape static HTML content.
import requests
from bs4 import BeautifulSoup
url = "http://example.com/azerbaijani-news"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Example: Extracting headlines and their URLs
headlines = soup.find_all('h2', class_='headline')
for headline in headlines:
text = headline.get_text()
link = headline.find('a')['href']
print(f"Headline: {text}, URL: {link}")
Download Audio: If the website contains embedded audio, extract and download it using direct links.
Step-by-Step Process:
Use Selenium for Dynamic Content: Automate browser actions to interact with dynamic content.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("http://example.com/azerbaijani-videos")
# Example: Click a button to load more content
load_more_button = driver.find_element(By.XPATH, "//button[text()='Load More']")
load_more_button.click()
# Extract video links after loading more content
video_links = driver.find_elements(By.XPATH, "//a[contains(@href, '/watch')]")
for link in video_links:
print(link.get_attribute('href'))
driver.quit()
After scraping, clean and prepare the data:
Store the collected data in a structured format:
Validate the data by manually checking a subset of the audio and text pairs for accuracy and consistency.
This approach provides a robust method for collecting and preparing a dataset for training speech recognition and text-to-speech models in Azerbaijani.
Subtitles are text representations of the spoken dialogue in videos or audio recordings. They are typically synchronized with the audio, displaying the text on-screen at the appropriate times to match what is being said. Subtitles serve several purposes, including:
Subtitles usually come in specific file formats, the most common of which are:
The SRT format is one of the most commonly used subtitle formats. It contains plain text with a sequence number, start and end timecodes, and the subtitle text.
Example SRT File:
1
00:00:01,000 --> 00:00:04,000
Hello, how are you?
2
00:00:05,000 --> 00:00:07,000
I'm good, thank you!
The VTT format is commonly used for web-based video players like HTML5 video. It is similar to the SRT format but includes additional features for styling and metadata.
Example VTT File:
WEBVTT
1
00:00:01.000 --> 00:00:04.000
Hello, how are you?
2
00:00:05.000 --> 00:00:07.000
I'm good, thank you!
Downloading Subtitles:
youtube-dl
to download subtitles from YouTube videos.youtube-dl --write-auto-sub --sub-lang az --skip-download <video_url>
Parsing Subtitles:
Example Python Code for Parsing SRT Subtitles:
import re
def parse_srt(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
subtitles = re.findall(r'\d+\n(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})\n(.+)', content)
parsed_subtitles = []
for subtitle in subtitles:
start_time = subtitle[0]
end_time = subtitle[1]
text = subtitle[2].replace('\n', ' ')
parsed_subtitles.append((start_time, end_time, text))
return parsed_subtitles
Example Code for Segmenting Audio:
from pydub import AudioSegment
def segment_audio(audio_file, subtitles):
audio = AudioSegment.from_file(audio_file)
segments = []
for start_time, end_time, text in subtitles:
start_ms = time_to_ms(start_time)
end_ms = time_to_ms(end_time)
segment = audio[start_ms:end_ms]
segments.append((segment, text))
return segments
def time_to_ms(time_str):
h, m, s = map(float, time_str.replace(',', '.').split(':'))
return int((h * 3600 + m * 60 + s) * 1000)
By using subtitles, you can align text with audio accurately, creating a high-quality dataset for training speech recognition and text-to-speech models. Subtitles provide the necessary text transcripts that are crucial for these types of machine learning models.
Here's a more focused breakdown on how to gather and process audio data along with text:
Extract Video Links:
Download Subtitles and Audio:
youtube-dl
to download both audio and subtitles.Commands:
# Download subtitles
youtube-dl --write-auto-sub --sub-lang az --skip-download <video_url>
# Download audio
youtube-dl --extract-audio --audio-format wav <video_url>
Parse Subtitles:
Example Code:
import re
def parse_srt(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
subtitles = re.findall(r'\d+\n(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})\n(.+)', content)
parsed_subtitles = []
for subtitle in subtitles:
start_time = subtitle[0]
end_time = subtitle[1]
text = subtitle[2].replace('\n', ' ')
parsed_subtitles.append((start_time, end_time, text))
return parsed_subtitles
Identify Sources:
Scrape HTML:
Example Code:
import requests
from bs4 import BeautifulSoup
url = "http://example.com/azerbaijani-podcasts"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
audio_links = soup.find_all('a', href=True)
for link in audio_links:
if link['href'].endswith('.mp3'):
audio_url = link['href']
# Download the audio file
audio_response = requests.get(audio_url)
with open(f"audio/{audio_url.split('/')[-1]}", 'wb') as file:
file.write(audio_response.content)
Use Selenium:
Example Code:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("http://example.com/azerbaijani-news-videos")
# Example: Click to load more content
load_more_button = driver.find_element(By.XPATH, "//button[text()='Load More']")
load_more_button.click()
# Extract video links
video_links = driver.find_elements(By.XPATH, "//a[contains(@href, '/watch')]")
for link in video_links:
video_url = link.get_attribute('href')
# Use youtube-dl to download audio and subtitles as shown earlier
driver.quit()
Segmentation:
Example Code:
from pydub import AudioSegment
def segment_audio(audio_file, subtitles):
audio = AudioSegment.from_file(audio_file)
segments = []
for start_time, end_time, text in subtitles:
start_ms = time_to_ms(start_time)
end_ms = time_to_ms(end_time)
segment = audio[start_ms:end_ms]
segments.append((segment, text))
return segments
def time_to_ms(time_str):
h, m, s = map(float, time_str.replace(',', '.').split(':'))
return int((h * 3600 + m * 60 + s) * 1000)
Directory Structure:
dataset/
├── audio/
│ ├── file1.wav
│ ├── file2.wav
├── transcripts/
│ ├── file1.txt
│ ├── file2.txt
Manual Verification:
By following these steps, you can collect, process, and organize both audio and text data for training speech recognition and text-to-speech models in Azerbaijani. This ensures you have a robust dataset that covers various aspects of the language.
Training speech recognition and text-to-speech models from scratch in Azerbaijani will require a comprehensive dataset of high-quality audio and corresponding text transcriptions. Here are the steps to obtain or create such a dataset:
Collect Existing Datasets:
Create Your Own Dataset:
Public Domain and Open-Source Content:
Web Scraping:
Partner with Institutions:
Data Preparation:
Annotation Tools:
Quality Assurance: