Data Collection - Githubissues

Training speech recognition and text-to-speech models from scratch in Azerbaijani will require a comprehensive dataset of high-quality audio and corresponding text transcriptions. Here are the steps to obtain or create such a dataset:

Collect Existing Datasets:
- Common Voice by Mozilla: Check if they have Azerbaijani data. This project is an open-source dataset of voices that might have contributions in Azerbaijani.
- OpenSLR: They provide free resources for speech and language research. They might have or guide you to Azerbaijani datasets.
- Linguistic Data Consortium (LDC): They provide language resources including speech datasets, though access may require a subscription.
Create Your Own Dataset:
- Crowdsourcing: Use platforms like Amazon Mechanical Turk or Appen to gather native Azerbaijani speakers to record phrases and sentences.
- Recording Sessions: Organize professional recording sessions with native speakers to create high-quality audio samples.
Public Domain and Open-Source Content:
- Audiobooks and Podcasts: Source public domain audiobooks and podcasts in Azerbaijani and manually transcribe them or use semi-automated tools to aid in transcription.
- Broadcasts and Lectures: Collect speeches, news broadcasts, and academic lectures that might be available for free use.
Web Scraping:
- YouTube: Scrape YouTube for Azerbaijani language content that includes subtitles or closed captions, which can provide aligned text and speech data.
- Publicly Available Media: Scrape websites, public domain books, and other media in Azerbaijani.
Partner with Institutions:
- Universities: Collaborate with universities in Azerbaijan that might have linguistic departments or ongoing projects related to language data collection.
- Government Initiatives: Look for any government initiatives aimed at preserving or promoting the Azerbaijani language.
Data Preparation:
- Transcription: Ensure high-quality transcription of audio data into text.
- Segmentation: Segment audio data into smaller, manageable clips paired with their text transcripts.
- Cleaning: Clean the dataset to remove noise, errors, and inconsistencies.
Annotation Tools:
- Use annotation tools such as ELAN, Praat, or custom-built tools for aligning audio with text.
Quality Assurance:
- Validation: Validate the dataset by checking the alignment of audio and text through manual or automated means.
- Diversity: Ensure the dataset covers diverse accents, dialects, and speaking styles within the Azerbaijani language.

Here is a detailed breakdown of the web scraping approach to collect Azerbaijani speech and text data for training speech recognition and text-to-speech models:

1. Identify Target Websites

Identify websites that contain Azerbaijani speech content with text transcriptions or subtitles, such as:

YouTube (videos with Azerbaijani subtitles)
News websites with video/audio content
Audiobook sites with Azerbaijani content
Podcast platforms

2. Web Scraping Tools

Use web scraping tools and libraries such as:

BeautifulSoup (Python): For parsing HTML and XML documents.
Scrapy (Python): A powerful scraping framework.
Selenium: For scraping dynamic websites that require interaction, like YouTube.
YouTube-dl: For downloading YouTube videos and extracting subtitles.

3. Setting Up the Environment

Install the necessary Python packages:

pip install beautifulsoup4 scrapy selenium youtube-dl

4. Scraping YouTube

Step-by-Step Process:

Extract Video Links: Use YouTube API or a manual search to collect URLs of Azerbaijani videos.
Download Subtitles: Use youtube-dl to download subtitles.
```
youtube-dl --write-sub --sub-lang az --skip-download <video_url>
```
This command downloads Azerbaijani subtitles without downloading the video.

Download Audio: Use youtube-dl to download audio.

youtube-dl --extract-audio --audio-format wav <video_url>

Parsing Subtitles: Parse the downloaded subtitle files (.vtt or .srt) to extract text and timestamps.

Example Code to Parse Subtitles:

import re

def parse_srt(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    subtitles = re.findall(r'\d+\n(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})\n(.+)', content)
    parsed_subtitles = []

    for subtitle in subtitles:
        start_time = subtitle[0]
        end_time = subtitle[1]
        text = subtitle[2].replace('\n', ' ')
        parsed_subtitles.append((start_time, end_time, text))

    return parsed_subtitles

5. Scraping Static Websites

Step-by-Step Process:

Identify Static Content: Find news websites, audiobooks, and podcasts in Azerbaijani.

Scrape HTML: Use requests and BeautifulSoup to scrape static HTML content.

import requests
from bs4 import BeautifulSoup

url = "http://example.com/azerbaijani-news"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Example: Extracting headlines and their URLs
headlines = soup.find_all('h2', class_='headline')
for headline in headlines:
    text = headline.get_text()
    link = headline.find('a')['href']
    print(f"Headline: {text}, URL: {link}")

Download Audio: If the website contains embedded audio, extract and download it using direct links.

6. Scraping Dynamic Websites

Step-by-Step Process:

Use Selenium for Dynamic Content: Automate browser actions to interact with dynamic content.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.get("http://example.com/azerbaijani-videos")

# Example: Click a button to load more content
load_more_button = driver.find_element(By.XPATH, "//button[text()='Load More']")
load_more_button.click()

# Extract video links after loading more content
video_links = driver.find_elements(By.XPATH, "//a[contains(@href, '/watch')]")
for link in video_links:
    print(link.get_attribute('href'))

driver.quit()

7. Data Cleaning and Preparation

After scraping, clean and prepare the data:

Text Cleaning: Remove special characters, HTML tags, and perform normalization.
Align Audio and Text: Ensure timestamps in subtitles align correctly with the audio segments.

8. Storing and Organizing Data

Store the collected data in a structured format:

Use directories to organize audio files and their corresponding transcriptions.
Save metadata like timestamps, URLs, and speaker information in a database or CSV files.

9. Quality Assurance

Validate the data by manually checking a subset of the audio and text pairs for accuracy and consistency.

This approach provides a robust method for collecting and preparing a dataset for training speech recognition and text-to-speech models in Azerbaijani.

Subtitles are text representations of the spoken dialogue in videos or audio recordings. They are typically synchronized with the audio, displaying the text on-screen at the appropriate times to match what is being said. Subtitles serve several purposes, including:

Accessibility: They help people who are deaf or hard of hearing to understand the audio content.
Language Learning: They assist people learning a new language by providing a text version of the spoken words.
Translation: Subtitles can translate spoken dialogue into another language, making the content accessible to a wider audience.
Context Clarity: They provide clarity in noisy environments or when the audio quality is poor.

Subtitles usually come in specific file formats, the most common of which are:

1. SRT (SubRip Subtitle) Format

The SRT format is one of the most commonly used subtitle formats. It contains plain text with a sequence number, start and end timecodes, and the subtitle text.

Example SRT File:

1
00:00:01,000 --> 00:00:04,000
Hello, how are you?

2
00:00:05,000 --> 00:00:07,000
I'm good, thank you!

2. VTT (WebVTT) Format

The VTT format is commonly used for web-based video players like HTML5 video. It is similar to the SRT format but includes additional features for styling and metadata.

Example VTT File:

WEBVTT

1
00:00:01.000 --> 00:00:04.000
Hello, how are you?

2
00:00:05.000 --> 00:00:07.000
I'm good, thank you!

Steps to Use Subtitles for Collecting Data

Downloading Subtitles:
- Use tools like youtube-dl to download subtitles from YouTube videos.
- Example command to download Azerbaijani subtitles:
```
youtube-dl --write-auto-sub --sub-lang az --skip-download <video_url>
```
Parsing Subtitles:
- After downloading, parse the subtitle files to extract text and timestamps.
- Use regular expressions or subtitle parsing libraries to read the subtitle files.

Example Python Code for Parsing SRT Subtitles:

import re

def parse_srt(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    subtitles = re.findall(r'\d+\n(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})\n(.+)', content)
    parsed_subtitles = []

    for subtitle in subtitles:
        start_time = subtitle[0]
        end_time = subtitle[1]
        text = subtitle[2].replace('\n', ' ')
        parsed_subtitles.append((start_time, end_time, text))

    return parsed_subtitles

Aligning Audio and Subtitles:
- Use the timestamps from the subtitles to segment the corresponding audio file.
- Ensure that each text segment aligns accurately with the spoken words in the audio.

Example Code for Segmenting Audio:

from pydub import AudioSegment

def segment_audio(audio_file, subtitles):
    audio = AudioSegment.from_file(audio_file)
    segments = []

    for start_time, end_time, text in subtitles:
        start_ms = time_to_ms(start_time)
        end_ms = time_to_ms(end_time)
        segment = audio[start_ms:end_ms]
        segments.append((segment, text))

    return segments

def time_to_ms(time_str):
    h, m, s = map(float, time_str.replace(',', '.').split(':'))
    return int((h * 3600 + m * 60 + s) * 1000)

By using subtitles, you can align text with audio accurately, creating a high-quality dataset for training speech recognition and text-to-speech models. Subtitles provide the necessary text transcripts that are crucial for these types of machine learning models.

Here's a more focused breakdown on how to gather and process audio data along with text:

1. Scraping YouTube for Audio and Subtitles

Extract Video Links:

Identify YouTube videos in Azerbaijani with subtitles. You can use YouTube's search function or the YouTube API to find relevant videos.

Download Subtitles and Audio:

Use youtube-dl to download both audio and subtitles.

Commands:

# Download subtitles
youtube-dl --write-auto-sub --sub-lang az --skip-download <video_url>

# Download audio
youtube-dl --extract-audio --audio-format wav <video_url>

Parse Subtitles:

After downloading the subtitles, use a script to extract the text and timestamps.

Example Code:

import re

def parse_srt(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    subtitles = re.findall(r'\d+\n(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})\n(.+)', content)
    parsed_subtitles = []

    for subtitle in subtitles:
        start_time = subtitle[0]
        end_time = subtitle[1]
        text = subtitle[2].replace('\n', ' ')
        parsed_subtitles.append((start_time, end_time, text))

    return parsed_subtitles

2. Scraping Static Websites with Audio Content

Identify Sources:

Look for news websites, audiobooks, and podcasts that provide Azerbaijani content.

Scrape HTML:

Use BeautifulSoup to scrape the HTML and extract links to audio files.

Example Code:

import requests
from bs4 import BeautifulSoup

url = "http://example.com/azerbaijani-podcasts"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

audio_links = soup.find_all('a', href=True)
for link in audio_links:
    if link['href'].endswith('.mp3'):
        audio_url = link['href']
        # Download the audio file
        audio_response = requests.get(audio_url)
        with open(f"audio/{audio_url.split('/')[-1]}", 'wb') as file:
            file.write(audio_response.content)

3. Scraping Dynamic Websites with Selenium

Use Selenium:

For websites requiring interaction, such as clicking buttons to load more content, Selenium is useful.

Example Code:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("http://example.com/azerbaijani-news-videos")

# Example: Click to load more content
load_more_button = driver.find_element(By.XPATH, "//button[text()='Load More']")
load_more_button.click()

# Extract video links
video_links = driver.find_elements(By.XPATH, "//a[contains(@href, '/watch')]")
for link in video_links:
    video_url = link.get_attribute('href')
    # Use youtube-dl to download audio and subtitles as shown earlier

driver.quit()

4. Aligning Audio and Text Data

Segmentation:

Use the timestamps from subtitles to segment the audio files accordingly.

Example Code:

from pydub import AudioSegment

def segment_audio(audio_file, subtitles):
    audio = AudioSegment.from_file(audio_file)
    segments = []

    for start_time, end_time, text in subtitles:
        start_ms = time_to_ms(start_time)
        end_ms = time_to_ms(end_time)
        segment = audio[start_ms:end_ms]
        segments.append((segment, text))

    return segments

def time_to_ms(time_str):
    h, m, s = map(float, time_str.replace(',', '.').split(':'))
    return int((h * 3600 + m * 60 + s) * 1000)

5. Storing and Organizing Data

Directory Structure:

Organize audio files and corresponding text in a structured manner.

Example:

dataset/
├── audio/
│   ├── file1.wav
│   ├── file2.wav
├── transcripts/
│   ├── file1.txt
│   ├── file2.txt

6. Quality Assurance

Manual Verification:

Check a subset of the data to ensure accuracy and proper alignment.

By following these steps, you can collect, process, and organize both audio and text data for training speech recognition and text-to-speech models in Azerbaijani. This ensures you have a robust dataset that covers various aspects of the language.

GasimV / Commercial_Projects

Data Collection #4

1. Identify Target Websites

2. Web Scraping Tools

3. Setting Up the Environment

4. Scraping YouTube

5. Scraping Static Websites

6. Scraping Dynamic Websites

7. Data Cleaning and Preparation

8. Storing and Organizing Data

9. Quality Assurance

1. SRT (SubRip Subtitle) Format

2. VTT (WebVTT) Format

Steps to Use Subtitles for Collecting Data

1. Scraping YouTube for Audio and Subtitles

2. Scraping Static Websites with Audio Content

3. Scraping Dynamic Websites with Selenium

4. Aligning Audio and Text Data

5. Storing and Organizing Data

6. Quality Assurance