Phonetic Similarity - Githubissues

StephanAkkerman commented 1 year ago

We want to be able to fill in a word + language and get their IPA translation from it. This specific task is called "Grapheme-to-phoneme conversion" (G2P).

Paper's implementation uses Scala: https://github.com/msavva/transphoner Transphoner Paper: https://graphics.stanford.edu/projects/transphoner/TransPhoner.pdf

Maybe we can use this Python code for generating the phonemes: https://github.com/xinjli/transphone

The final goal is to be able to input an English word and output a foreign word that sounds similar to it.

[x] Convert word to IPA equivalent
[x] Similarity measure between 2 foreign IPA words
[ ] Find most similar word(s) in a foreign language based on English word (how?)

Idea:

Create a vector space of English words and place the foreign word in there and find the nearest (this could work for both phonetic and semantic similarity)

Use data/eng_latn_us_broad.tsv for the word + ipa transcription
Convert IPA to features and make it a vector (embedding)
Use something like RNN to learn the sequence
Convert foreign word to IPA and find the top 5 most similar sounding

Related papers on phonetic embeddings:

Phonetic features:

Related projects:

Phonetic word embeddings:

Datasets:

Word + IPA datasets:

StephanAkkerman commented 1 year ago

Outline of the process:

StephanAkkerman commented 1 year ago

Need to have a dataset of IPA pronunciations for the languages that we want to learn.
Use https://github.com/xinjli/transphone to do word/graphene to phoneme (IPA), for the language that we want to learn, assume we always use English as base language

StephanAkkerman commented 1 year ago

Phonetic Similarity Use ALINE for phonetic similarity

We use a Levenshtein distance, now with a weight matrix defining the cost of substituting one phone for another, to compute the phonetic distance between two words. We search for an alignment (mapping of phones in one word to those of another) such that the Levenshtein distance is minimized.

We simplify the ALINE algorithm by omitting phone expansions (i.e. no special costs for matching two phones to one, or one phone to two). We handle matching of syllable breaks by adding a constant match score Csep = 20, and a mismatch cost Cskip = −10, as defined by ALINE. The phone feature set used by ALINE is not complete—a few phones are considered equivalent (e.g. /i/, /ɪ/), so we add an extra weight for preferring exact phone matches.

C_skip is defined in ALINE, but not C_sep. Maybe see the Scala implementation for this

ALINE output values are divided by 100 to make them comparable with similarities from other components.

Other methods for calculating

Metaphone
Doublemetaphone and example here

StephanAkkerman commented 1 year ago

Could also consider scraping wiktionary for all IPA pronunciations: https://en.wiktionary.org/wiki/Category:Indonesian_terms_with_IPA_pronunciation

StephanAkkerman commented 1 year ago

https://linguistics.stackexchange.com/questions/43492/finding-phonetic-similarity-of-names-in-different-languages

First, let's create a phoneme similarity function as you described. For this purpose, we will use a dictionary to map phonemes to their categories, and then define a function to calculate the similarity based on these categories.

# Create phoneme categories
phoneme_categories = {
    "aehijouwæɑɒɔəɛɜɪʊʌʔ": 0,  # vowels, semivowels, and weak consonants
    "bfpv": 1,  # labial consonants
    "ksxzɡʃʒ": 2,  # velars and sibilant consonants
    "dtðθ": 3,  # dental consonants
    "l": 4,  # lateral consonant
    "mnŋ": 5,  # nasals
    "ɹ": 6,  # rhotic
}

# Flatten the dictionary to map each phoneme to its category
phoneme_to_category = {
    phoneme: category
    for category, phonemes in phoneme_categories.items()
    for phoneme in phonemes
}

def phoneme_similarity(phoneme1, phoneme2):
    """Calculates the similarity between two phonemes."""
    if phoneme1 == phoneme2:
        return 1
    elif phoneme_to_category.get(phoneme1) == phoneme_to_category.get(phoneme2):
        return 0.5
    else:
        return 0

Next, we'll use this phoneme similarity function to calculate the similarity between two words. We'll use a dynamic programming approach similar to the Longest Common Subsequence problem.

def word_similarity(word1, word2):
    """Calculates the similarity between two words based on their phonemes."""
    len1 = len(word1)
    len2 = len(word2)
    match_table = [[0] * (len2 + 1) for _ in range(len1 + 1)]
    for i in range(1, len1 + 1):
        for j in range(1, len2 + 1):
            match_table[i][j] = max(
                match_table[i-1][j-1] + phoneme_similarity(word1[i-1], word2[j-1]),
                match_table[i-1][j],
                match_table[i][j-1]
            )
    return match_table[len1][len2] / (len1 + len2) * 2.0

You can now use word_similarity to calculate the similarity between two words, given their phoneme sequences. For example, to compare the words "hello" and "world", you could call word_similarity('hɛ.ˈloʊ', 'wɜɹld').

Remember, this is a basic implementation, and there are many ways to improve and refine it, as you've mentioned. For example, handling the words' consonant and vowel structures separately, handling swapped phonemes, or using a more sophisticated phoneme similarity function.

StephanAkkerman commented 1 year ago

To measure the similarity between two IPA strings, you could use a string distance metric such as the Levenshtein distance, which is simple to understand and implement. It counts the minimum number of edits (insertions, deletions, substitutions) needed to transform one string into the other.

However, if you want to account for the phonetic similarity between different phonemes, you could consider using a phonetic alignment algorithm like Needleman-Wunsch or Smith-Waterman, with a substitution matrix that reflects the phonetic similarity between different phonemes.

Keep in mind that this is a relatively simple approach and might not account for all the nuances of phonetic similarity. Also, because IPA is a broad phonetic transcription system, the same sound can be transcribed differently depending on the level of detail you want to include, and different languages might use different symbols to represent similar sounds. Therefore, you might need to preprocess your IPA strings to ensure they use a consistent set of symbols.

Here is a simplified example of how you might use the Levenshtein distance to compare two IPA strings in Python, using the python-Levenshtein package:

import Levenshtein as lev

ipa1 = "hɛ.ˈloʊ"  # English "hello"
ipa2 = "hə.ˈlo"  # Hypothetical Indonesian word

distance = lev.distance(ipa1, ipa2)

print(distance)

Remember to install the package with pip install python-Levenshtein if you haven't already.

The output is the minimum number of edits needed to transform ipa1 into ipa2. The lower the number, the more similar the two strings are.

StephanAkkerman commented 1 year ago

Sure, the Needleman-Wunsch algorithm is a standard algorithm for sequence alignment in bioinformatics, typically used for DNA or protein sequence alignment. The idea is to find the best alignment (including gaps) between two sequences, given a substitution matrix that scores the similarity between each pair of elements.

In the case of IPA strings, you could define a substitution matrix that scores the phonetic similarity between different phonemes. This would be a complex task in itself, as you would need to define a similarity score for each pair of IPA symbols, but let's assume you have such a matrix.

Here's a simple implementation of the Needleman-Wunsch algorithm in Python, adapted to work with strings of IPA symbols:

import numpy as np

def needleman_wunsch(seq1, seq2, sub_matrix, gap_penalty=-1):
    """Performs the Needleman-Wunsch sequence alignment."""
    n, m = len(seq1), len(seq2)

    # Initialize score matrix with zeros
    score_matrix = np.zeros((n+1, m+1))

    # Initialize first row and column with gap penalty
    score_matrix[0, :] = np.arange(m+1) * gap_penalty
    score_matrix[:, 0] = np.arange(n+1) * gap_penalty

    # Fill in the rest of the matrix
    for i in range(1, n+1):
        for j in range(1, m+1):
            match = score_matrix[i-1, j-1] + sub_matrix[seq1[i-1], seq2[j-1]]
            delete = score_matrix[i-1, j] + gap_penalty
            insert = score_matrix[i, j-1] + gap_penalty
            score_matrix[i, j] = max(match, delete, insert)

    return score_matrix[n, m]

# Example usage
ipa1 = "hɛ.ˈloʊ"  # English "hello"
ipa2 = "hə.ˈlo"  # Hypothetical Indonesian word

# Define a simple substitution matrix where identical symbols score 1
# and different symbols score -1
sub_matrix = {
    a: {b: 1 if a == b else -1 for b in set(ipa1 + ipa2)}
    for a in set(ipa1 + ipa2)
}

score = needleman_wunsch(ipa1, ipa2, sub_matrix)

print(score)

This code performs a simple global sequence alignment of the two IPA strings, scoring 1 for identical symbols and -1 for different symbols. It also includes a gap penalty of -1. The output is the maximum score of any alignment of the two strings.

Please note that this is a basic example and there are many ways to improve and refine this approach, such as using a more complex substitution matrix, adjusting the gap penalty, or implementing local alignment (Smith-Waterman) instead of global alignment (Needleman-Wunsch). Also, this simple implementation doesn't actually construct the alignment itself, just the score of the optimal alignment. Constructing the alignment would require additional code to backtrack through the score matrix.

StephanAkkerman commented 1 year ago

nltk's ALINE seems to run into a lot of issues, because some characters do not exist in its feature matrix

StephanAkkerman commented 1 year ago

Could try https://abydos.readthedocs.io/en/latest/abydos.distance.html#abydos.distance.ALINE and add the newer / missing IPA characters as a dict to it. Seems like an outdated library...

StephanAkkerman commented 1 year ago

Could copy source-code of https://abydos.readthedocs.io/en/latest/_modules/abydos/distance/_aline.html#ALINE and add missing characters to it.

StephanAkkerman commented 1 year ago

See https://github.com/msavva/transphoner/tree/553c0d5923af1d5055bbccb34d8cbbf0fdd1d8ef/libtransphone/src/main/scala/org/babysherlock/transphoner/phonetics of what to do with c_skip etc.

StephanAkkerman commented 1 year ago

Rewrite NLTK's or abydos' ALINE code and use scala implementation as guidance. Also make sure all (new) IPA characters are supported, including breaks. See Wikipedia page for the possible characters

StephanAkkerman commented 1 year ago

Also include special characters found in https://github.com/msavva/transphoner/blob/master/libtransphone/src/main/scala/org/babysherlock/transphoner/phonetics/Phone.scala#L218

StephanAkkerman commented 1 year ago

Also use files found in folder "data for transphoner"

StephanAkkerman commented 4 months ago

https://www.reddit.com/r/linguistics/comments/ovewb3/is_there_a_tool_that_can_convert_a_language_to/

StephanAkkerman commented 4 months ago

https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/tts/g2p.html# https://github.com/NVIDIA/NeMo/tree/main/examples/tts/g2p

StephanAkkerman commented 4 months ago

https://github.com/lingjzhu/CharsiuG2P https://huggingface.co/charsiu/g2p_multilingual_byT5_small_100

StephanAkkerman commented 4 months ago

import panphon.distance
dst = panphon.distance.Distance().feature_edit_distance
# add all English words here
WORDS = ["closing", "cucci", "gnocchi", "kissing"]
closest_word = min(WORDS, key=lambda x: dst("kucing", x))
# cucci
print(closest_word)

Better approach: embed all words into vectors and then perform maximum-inner-product-search.

StephanAkkerman / FluentAI

Phonetic Similarity #2