GasimV / Commercial_Projects

This repository showcase my projects from IT companies, government organizations and any business-related work.
0 stars 0 forks source link

Data Collection #4

Open GasimV opened 3 months ago

GasimV commented 3 months ago

Training speech recognition and text-to-speech models from scratch in Azerbaijani will require a comprehensive dataset of high-quality audio and corresponding text transcriptions. Here are the steps to obtain or create such a dataset:

  1. Collect Existing Datasets:

    • Common Voice by Mozilla: Check if they have Azerbaijani data. This project is an open-source dataset of voices that might have contributions in Azerbaijani.
    • OpenSLR: They provide free resources for speech and language research. They might have or guide you to Azerbaijani datasets.
    • Linguistic Data Consortium (LDC): They provide language resources including speech datasets, though access may require a subscription.
  2. Create Your Own Dataset:

    • Crowdsourcing: Use platforms like Amazon Mechanical Turk or Appen to gather native Azerbaijani speakers to record phrases and sentences.
    • Recording Sessions: Organize professional recording sessions with native speakers to create high-quality audio samples.
  3. Public Domain and Open-Source Content:

    • Audiobooks and Podcasts: Source public domain audiobooks and podcasts in Azerbaijani and manually transcribe them or use semi-automated tools to aid in transcription.
    • Broadcasts and Lectures: Collect speeches, news broadcasts, and academic lectures that might be available for free use.
  4. Web Scraping:

    • YouTube: Scrape YouTube for Azerbaijani language content that includes subtitles or closed captions, which can provide aligned text and speech data.
    • Publicly Available Media: Scrape websites, public domain books, and other media in Azerbaijani.
  5. Partner with Institutions:

    • Universities: Collaborate with universities in Azerbaijan that might have linguistic departments or ongoing projects related to language data collection.
    • Government Initiatives: Look for any government initiatives aimed at preserving or promoting the Azerbaijani language.
  6. Data Preparation:

    • Transcription: Ensure high-quality transcription of audio data into text.
    • Segmentation: Segment audio data into smaller, manageable clips paired with their text transcripts.
    • Cleaning: Clean the dataset to remove noise, errors, and inconsistencies.
  7. Annotation Tools:

    • Use annotation tools such as ELAN, Praat, or custom-built tools for aligning audio with text.
  8. Quality Assurance:

    • Validation: Validate the dataset by checking the alignment of audio and text through manual or automated means.
    • Diversity: Ensure the dataset covers diverse accents, dialects, and speaking styles within the Azerbaijani language.
GasimV commented 3 months ago

Here is a detailed breakdown of the web scraping approach to collect Azerbaijani speech and text data for training speech recognition and text-to-speech models:

1. Identify Target Websites

Identify websites that contain Azerbaijani speech content with text transcriptions or subtitles, such as:

2. Web Scraping Tools

Use web scraping tools and libraries such as:

3. Setting Up the Environment

Install the necessary Python packages:

pip install beautifulsoup4 scrapy selenium youtube-dl

4. Scraping YouTube

Step-by-Step Process:

  1. Extract Video Links: Use YouTube API or a manual search to collect URLs of Azerbaijani videos.
  2. Download Subtitles: Use youtube-dl to download subtitles.
    youtube-dl --write-sub --sub-lang az --skip-download <video_url>

    This command downloads Azerbaijani subtitles without downloading the video.

  3. Download Audio: Use youtube-dl to download audio.
    youtube-dl --extract-audio --audio-format wav <video_url>
  4. Parsing Subtitles: Parse the downloaded subtitle files (.vtt or .srt) to extract text and timestamps.

Example Code to Parse Subtitles:

import re

def parse_srt(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    subtitles = re.findall(r'\d+\n(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})\n(.+)', content)
    parsed_subtitles = []

    for subtitle in subtitles:
        start_time = subtitle[0]
        end_time = subtitle[1]
        text = subtitle[2].replace('\n', ' ')
        parsed_subtitles.append((start_time, end_time, text))

    return parsed_subtitles

5. Scraping Static Websites

Step-by-Step Process:

  1. Identify Static Content: Find news websites, audiobooks, and podcasts in Azerbaijani.

  2. Scrape HTML: Use requests and BeautifulSoup to scrape static HTML content.

    import requests
    from bs4 import BeautifulSoup
    
    url = "http://example.com/azerbaijani-news"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Example: Extracting headlines and their URLs
    headlines = soup.find_all('h2', class_='headline')
    for headline in headlines:
        text = headline.get_text()
        link = headline.find('a')['href']
        print(f"Headline: {text}, URL: {link}")
  3. Download Audio: If the website contains embedded audio, extract and download it using direct links.

6. Scraping Dynamic Websites

Step-by-Step Process:

  1. Use Selenium for Dynamic Content: Automate browser actions to interact with dynamic content.

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.common.keys import Keys
    
    driver = webdriver.Chrome()
    driver.get("http://example.com/azerbaijani-videos")
    
    # Example: Click a button to load more content
    load_more_button = driver.find_element(By.XPATH, "//button[text()='Load More']")
    load_more_button.click()
    
    # Extract video links after loading more content
    video_links = driver.find_elements(By.XPATH, "//a[contains(@href, '/watch')]")
    for link in video_links:
        print(link.get_attribute('href'))
    
    driver.quit()

7. Data Cleaning and Preparation

After scraping, clean and prepare the data:

  1. Text Cleaning: Remove special characters, HTML tags, and perform normalization.
  2. Align Audio and Text: Ensure timestamps in subtitles align correctly with the audio segments.

8. Storing and Organizing Data

Store the collected data in a structured format:

9. Quality Assurance

Validate the data by manually checking a subset of the audio and text pairs for accuracy and consistency.

This approach provides a robust method for collecting and preparing a dataset for training speech recognition and text-to-speech models in Azerbaijani.

GasimV commented 3 months ago

Subtitles are text representations of the spoken dialogue in videos or audio recordings. They are typically synchronized with the audio, displaying the text on-screen at the appropriate times to match what is being said. Subtitles serve several purposes, including:

  1. Accessibility: They help people who are deaf or hard of hearing to understand the audio content.
  2. Language Learning: They assist people learning a new language by providing a text version of the spoken words.
  3. Translation: Subtitles can translate spoken dialogue into another language, making the content accessible to a wider audience.
  4. Context Clarity: They provide clarity in noisy environments or when the audio quality is poor.

Subtitles usually come in specific file formats, the most common of which are:

1. SRT (SubRip Subtitle) Format

The SRT format is one of the most commonly used subtitle formats. It contains plain text with a sequence number, start and end timecodes, and the subtitle text.

Example SRT File:

1
00:00:01,000 --> 00:00:04,000
Hello, how are you?

2
00:00:05,000 --> 00:00:07,000
I'm good, thank you!

2. VTT (WebVTT) Format

The VTT format is commonly used for web-based video players like HTML5 video. It is similar to the SRT format but includes additional features for styling and metadata.

Example VTT File:

WEBVTT

1
00:00:01.000 --> 00:00:04.000
Hello, how are you?

2
00:00:05.000 --> 00:00:07.000
I'm good, thank you!

Steps to Use Subtitles for Collecting Data

  1. Downloading Subtitles:

    • Use tools like youtube-dl to download subtitles from YouTube videos.
    • Example command to download Azerbaijani subtitles:
      youtube-dl --write-auto-sub --sub-lang az --skip-download <video_url>
  2. Parsing Subtitles:

    • After downloading, parse the subtitle files to extract text and timestamps.
    • Use regular expressions or subtitle parsing libraries to read the subtitle files.

Example Python Code for Parsing SRT Subtitles:

import re

def parse_srt(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    subtitles = re.findall(r'\d+\n(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})\n(.+)', content)
    parsed_subtitles = []

    for subtitle in subtitles:
        start_time = subtitle[0]
        end_time = subtitle[1]
        text = subtitle[2].replace('\n', ' ')
        parsed_subtitles.append((start_time, end_time, text))

    return parsed_subtitles
  1. Aligning Audio and Subtitles:
    • Use the timestamps from the subtitles to segment the corresponding audio file.
    • Ensure that each text segment aligns accurately with the spoken words in the audio.

Example Code for Segmenting Audio:

from pydub import AudioSegment

def segment_audio(audio_file, subtitles):
    audio = AudioSegment.from_file(audio_file)
    segments = []

    for start_time, end_time, text in subtitles:
        start_ms = time_to_ms(start_time)
        end_ms = time_to_ms(end_time)
        segment = audio[start_ms:end_ms]
        segments.append((segment, text))

    return segments

def time_to_ms(time_str):
    h, m, s = map(float, time_str.replace(',', '.').split(':'))
    return int((h * 3600 + m * 60 + s) * 1000)

By using subtitles, you can align text with audio accurately, creating a high-quality dataset for training speech recognition and text-to-speech models. Subtitles provide the necessary text transcripts that are crucial for these types of machine learning models.

GasimV commented 3 months ago

Here's a more focused breakdown on how to gather and process audio data along with text:

1. Scraping YouTube for Audio and Subtitles

Extract Video Links:

Download Subtitles and Audio:

Commands:

# Download subtitles
youtube-dl --write-auto-sub --sub-lang az --skip-download <video_url>

# Download audio
youtube-dl --extract-audio --audio-format wav <video_url>

Parse Subtitles:

Example Code:

import re

def parse_srt(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    subtitles = re.findall(r'\d+\n(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})\n(.+)', content)
    parsed_subtitles = []

    for subtitle in subtitles:
        start_time = subtitle[0]
        end_time = subtitle[1]
        text = subtitle[2].replace('\n', ' ')
        parsed_subtitles.append((start_time, end_time, text))

    return parsed_subtitles

2. Scraping Static Websites with Audio Content

Identify Sources:

Scrape HTML:

Example Code:

import requests
from bs4 import BeautifulSoup

url = "http://example.com/azerbaijani-podcasts"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

audio_links = soup.find_all('a', href=True)
for link in audio_links:
    if link['href'].endswith('.mp3'):
        audio_url = link['href']
        # Download the audio file
        audio_response = requests.get(audio_url)
        with open(f"audio/{audio_url.split('/')[-1]}", 'wb') as file:
            file.write(audio_response.content)

3. Scraping Dynamic Websites with Selenium

Use Selenium:

Example Code:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("http://example.com/azerbaijani-news-videos")

# Example: Click to load more content
load_more_button = driver.find_element(By.XPATH, "//button[text()='Load More']")
load_more_button.click()

# Extract video links
video_links = driver.find_elements(By.XPATH, "//a[contains(@href, '/watch')]")
for link in video_links:
    video_url = link.get_attribute('href')
    # Use youtube-dl to download audio and subtitles as shown earlier

driver.quit()

4. Aligning Audio and Text Data

Segmentation:

Example Code:

from pydub import AudioSegment

def segment_audio(audio_file, subtitles):
    audio = AudioSegment.from_file(audio_file)
    segments = []

    for start_time, end_time, text in subtitles:
        start_ms = time_to_ms(start_time)
        end_ms = time_to_ms(end_time)
        segment = audio[start_ms:end_ms]
        segments.append((segment, text))

    return segments

def time_to_ms(time_str):
    h, m, s = map(float, time_str.replace(',', '.').split(':'))
    return int((h * 3600 + m * 60 + s) * 1000)

5. Storing and Organizing Data

Directory Structure:

6. Quality Assurance

Manual Verification:

By following these steps, you can collect, process, and organize both audio and text data for training speech recognition and text-to-speech models in Azerbaijani. This ensures you have a robust dataset that covers various aspects of the language.