cldf-clts / soundvectors

MIT License
1 stars 0 forks source link

Testing and Evaluating #7

Closed LinguList closed 3 months ago

LinguList commented 6 months ago

We would need clear tests in a tests directory, clear unittests. Examples are plenty, given in the pylogeny small packages, I think, these should suffice for starters.

LinguList commented 6 months ago

To evaluate, what about MDS plots that show the distribution of sounds across a couple of different languages? The crucial argument for such a package is the distinguishability, so that it can distinguish important aspects, so that languages do not end up with identical vectors for distinct sounds, right? So one should check to which degree this promise is fulfilled. This can be done with NorthEuralex data in CLDF. Or with Lexibank. I can provide examples, sound features are easily accessed from one dataset like nelex with cl_toolkit!

arubehn commented 6 months ago

As previously stated in private communication, I have unit tests in mind and will implement them soon.

MDS plots sound like a good idea, I was also thinking in the direction of heat maps that show the cosine similarity between common sounds.

It would be great if you could give me an example of how to access sound features with cl_toolkit!

LinguList commented 6 months ago

There's a series of blog posts introducing some aspects: https://calc.hypotheses.org/4266

LinguList commented 6 months ago

Your concrete case:

git clone git@github.com:lexibank/northeuralex
pip install cltoolkit
cd northeuralex
from cltoolkit import Wordlist
from pyclts import CLTS
from pycldf import Dataset

wl = Wordlist([Dataset.from_metadata("cldf/cldf-metadata.json")], ts=CLTS().bipa)
for language in wl.languages:
    for sound in language.sound_inventory.sounds:
        print(sound.name)
arubehn commented 6 months ago

Thanks a lot!

I have run an analysis on NorthEuraLex now, and the current system is fully distinctive for the sound (not phoneme!) inventories of 72 from 107 languages. I am emphasizing sound inventories since the NorthEuraLex transcriptions are phonetic, not phonemic, and thus include non-phonemic distinctions such as stops being unreleased. 99 of the 107 languages have two or less non-distinguishable sounds, which seems pretty reasonable to me, especially considering the equivalence classes.

If you run examples/coverage.py, you will get the exact metrics and all the confused sounds per language. I will discuss those in detail in a separate issue.

LinguList commented 6 months ago

I should emphasize that I do not believe that the distinction between phonology and phonetics really holds seriously, apart from some hard-core cases that end up in school books. There are too many cases where oppositions can never be tested, like in most SEA languages, where you cannot find a good reason why you give the same "phoneme" status to final unreleased stops as to initial stops that are voiceless (as is often done). So when I use phoneme inventory, I mean the sounds in a language, and it is essentially useful to include both versions of German ch in such inventories.

LinguList commented 6 months ago

The results are already interesting. We would want to see the sounds that are on-distinguishable then. And we would like to discuss if we need to maintain the contrasts.

LinguList commented 6 months ago

But you are perfectly right! The example is very nice. Most confused sounds are clear cases where an overly strict phonetic annotation gives the impression of missed phonological distinctions, which do never occur in the data.

However, the languages under Slavic influence all have -- potentially phonemic -- distinctions for a palatal series and a non-palatal series, and these are not rendered. So we might want an additional test of distinguishability or distinctive force.

I was thinking about this before: if we manage to break out CV(C) instances of the data, e.g., using LingPy's simple syllabification algorithm (or the one that we are now developing with @justalingwist), one could check how many syllable pairs would be merged. My bet would be that these languages with palatal series have some mergers, while Spanish etc. never merge the m with the labio-dental ɱ.

LinguList commented 6 months ago

Failure to distinguish f v̥ in Dutch seems like something one wants to improve also.

LinguList commented 6 months ago

But @arubehn I must say that we are now entering terrain already, where it gets linguistically interesting. We should -- if I manage to be in office tomorrow -- try to quickly discuss with @justalingwist what the potential of this study is (distinctivity is a topic I wanted to work on for a long time).

arubehn commented 6 months ago

Yes, let‘s discuss this soon - I agree that the results already look quite interesting. I will run the same kind of evaluation on different datasets tomorrow, since NorthEuraLex obviously has a typological bias.