Number of unique sounds has changed

arubehn commented 4 months ago

While rerunning the evaluation scripts to make sure that everything still works smoothly after the refactoring, I found out that the number of unique has slightly changed (from previously 5,317 to 5,255). Seems like the feature mapping for a small group of sounds has been changed accidentally... I will look into it.

arubehn commented 4 months ago

Results for the distinctiveness analysis are looking completely off now... something must have gone wrong along the recent changes we have made

arubehn commented 4 months ago

I fixed the issue regarding the distinctiveness analysis, the numbers are back to the correct order of magnitude. However, there are still some minor deviations that I will investigate.

LinguList commented 4 months ago

It shows why extensive tests are so useful.

LinguList commented 4 months ago

We should add a list of all features with their names to the test repository and run that also in the future to control every change.

arubehn commented 4 months ago

Agreed :)

What I would do - once I have found and fixed the error - is to generate a file containing the ~8,000 CLTS sounds (or a representative subset) with the expected feature vectors. Then we can consistently test against that. Or does reading data from external files mess with the automatic testing workflow?

LinguList commented 4 months ago

Yes, that would be cool. This does not take long to test, and we have something for the future!

arubehn commented 4 months ago

I have investigated all the mismatching sounds. It appears that they can be classified into three classes - and for two of them the error was actually in the old code, not in the new one:

the old code had assigned some 0 values to diphthong-related features in diphthongs, which of course should not be the case. The new code correctly specifies all relevant features as 1 or -1.
the new code did not assign the [+secondrounded] feature correctly for diphthongs that end in a rounded vowel. this bug was fixed.
the last mismatch only concerns two sounds, and might actually be an issue with CLTS: [xʀ̥] and [βɾ] are both represented as fricatives, not as consonant clusters by CLTS (even though, in both cases, only one segment is actually a fricative). @LinguList is this behavior intended? If not, I will open an issue on the CLTS GitHub. Anyway: the difference here is that these two sounds used to be [-strid] in the old code and are now [+strid] in the new code. But since the current representation by CLTS only encodes them as fricatives, and both sounds contain places of articulation that should be assigned [+strid] (uvular and alveolar, respectively); the desired behavior for our system would be to assign [+strid].

arubehn commented 4 months ago

Since no sound that we have analyzed in the plots or in the concordance lines analyses was affected, these parts can remain as they are. I will rerun the quantitative analysis on distinctiveness; there might be minor changes, but the figures will definitely remain in the same order of magnitude. So, essentially, I think only the numbers have to be changed accordingly :)

arubehn commented 4 months ago

Turns out, the numbers (almost) didn't change at all (probably since the bugs only affected relatively marked sounds, and were systematic) - the distinctiveness analysis per language now is back to exactly the same numbers, and from the ~8k CLTS sounds, we are now actually capable of providing one (1!) more unique feature vector (5318 instead of 5317) :D

LinguList commented 4 months ago

Well done :-) It makes me also glad to see that we could improve content-wise with the new code.

LinguList commented 4 months ago

The two sounds are most likely defined like this by Cormac, and I would trust his judgment here. They are hard-coded into CLTS, not generated, with the double-features (velar-and-uvular) that are generally very rare.

arubehn commented 4 months ago

Okay, that‘s perfect - then that case is also solved :)

cldf-clts / soundvectors

Number of unique sounds has changed #16