cldf-clts / soundvectors

MIT License
1 stars 0 forks source link

Number of unique sounds has changed #16

Closed arubehn closed 4 months ago

arubehn commented 4 months ago

While rerunning the evaluation scripts to make sure that everything still works smoothly after the refactoring, I found out that the number of unique has slightly changed (from previously 5,317 to 5,255). Seems like the feature mapping for a small group of sounds has been changed accidentally... I will look into it.

arubehn commented 4 months ago

Results for the distinctiveness analysis are looking completely off now... something must have gone wrong along the recent changes we have made

arubehn commented 4 months ago

I fixed the issue regarding the distinctiveness analysis, the numbers are back to the correct order of magnitude. However, there are still some minor deviations that I will investigate.

LinguList commented 4 months ago

It shows why extensive tests are so useful.

LinguList commented 4 months ago

We should add a list of all features with their names to the test repository and run that also in the future to control every change.

arubehn commented 4 months ago

Agreed :)

What I would do - once I have found and fixed the error - is to generate a file containing the ~8,000 CLTS sounds (or a representative subset) with the expected feature vectors. Then we can consistently test against that. Or does reading data from external files mess with the automatic testing workflow?

LinguList commented 4 months ago

Yes, that would be cool. This does not take long to test, and we have something for the future!

arubehn commented 4 months ago

I have investigated all the mismatching sounds. It appears that they can be classified into three classes - and for two of them the error was actually in the old code, not in the new one:

arubehn commented 4 months ago

Since no sound that we have analyzed in the plots or in the concordance lines analyses was affected, these parts can remain as they are. I will rerun the quantitative analysis on distinctiveness; there might be minor changes, but the figures will definitely remain in the same order of magnitude. So, essentially, I think only the numbers have to be changed accordingly :)

arubehn commented 4 months ago

Turns out, the numbers (almost) didn't change at all (probably since the bugs only affected relatively marked sounds, and were systematic) - the distinctiveness analysis per language now is back to exactly the same numbers, and from the ~8k CLTS sounds, we are now actually capable of providing one (1!) more unique feature vector (5318 instead of 5317) :D

LinguList commented 4 months ago

Well done :-) It makes me also glad to see that we could improve content-wise with the new code.

LinguList commented 4 months ago

The two sounds are most likely defined like this by Cormac, and I would trust his judgment here. They are hard-coded into CLTS, not generated, with the double-features (velar-and-uvular) that are generally very rare.

arubehn commented 4 months ago

Okay, that‘s perfect - then that case is also solved :)