cldf-clts / pyclts

Apache License 2.0
11 stars 2 forks source link

add first tests for inventories #15

Closed LinguList closed 4 years ago

LinguList commented 4 years ago

This proposes a first very simple metric for comparing inventories. The idea is that we use the sound.similar(othersound) method based on jaccard-distance in pyclts to find the best matches between two inventories. Once this has been done, this yields a revised jaccard distance.

Furthermore, the "aspects" keyword allows to specify what one wants to compare. Since sounds are ordered by major feature values, one can compare also only the stops or only the nasals, or only the consonants.

LinguList commented 4 years ago

@tresoldi, this is what I would propose as a first step towards comparing two inventories. Not very advanced, but the idea is probably clear, that we also accept approximate matches, albeit in a greedy fashion here.

tresoldi commented 4 years ago

It is very good as a starting point. I was pondering about other ways of comparing, but they would all need a baseline for which the two methods are are perfect.

codecov-commenter commented 4 years ago

Codecov Report

Merging #15 into master will increase coverage by 1.36%. The diff coverage is 98.64%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #15      +/-   ##
==========================================
+ Coverage   95.80%   97.17%   +1.36%     
==========================================
  Files          28       30       +2     
  Lines        1241     1418     +177     
==========================================
+ Hits         1189     1378     +189     
+ Misses         52       40      -12     
Impacted Files Coverage Δ
src/pyclts/inventories.py 98.27% <98.27%> (ø)
tests/test_inventories.py 100.00% <100.00%> (ø)
src/pyclts/models.py 100.00% <0.00%> (ø)
tests/test_ipachart.py 100.00% <0.00%> (ø)
src/pyclts/ipachart.py 98.96% <0.00%> (+8.76%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update fafbe0f...2ad3ba8. Read the comment docs.

LinguList commented 4 years ago

What is missing is an explicit handling of unknown sounds. The idea would be: if there is an unknown sound, we default to identity as reflecting similarity, so if self is unknown and other is unknown, and their strings are identical, we still give 1.0, but we give 0 if either self or or is known when the other is unknown and when they are not the same. This helps to deal with datasets like phoible and nicholaev, to account for non-mapped sounds not available in clts.

xrotwang commented 4 years ago

But wouldn't this handling of unknown sounds render CLTS somewhat pointless? Isn't it more transparent to regard anything not mapped to CLTS as too obscure to be of relevance in any analysis?

Johann-Mattis List notifications@github.com schrieb am Mi., 3. Juni 2020, 22:30:

What is missing is an explicit handling of unknown sounds. The idea would be: if there is an unknown sound, we default to identity as reflecting similarity, so if self is unknown and other is unknown, and their strings are identical, we still give 1.0, but we give 0 if either self or or is known when the other is unknown and when they are not the same. This helps to deal with datasets like phoible and nicholaev, to account for non-mapped sounds not available in clts.

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/cldf-clts/pyclts/pull/15#issuecomment-638444726, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKEUIFOKRBEVQCRTJX3RU2XFFANCNFSM4NRZN2YQ .

LinguList commented 4 years ago

So in this case, those unknown sounds should just be excluded? This is already handled by the code, as the selection by "aspects" currently does not capture unknown sounds. I just wonder if it obscures the results, if one has too many of them unknown sounds in a given inventory?

xrotwang commented 4 years ago

Too many unknown sounds would mean inventories are not amenable to comparison with CLTS methods. But a threshold for what's too many might be interesting. E.g. if an inventory has 30 known sounds and 30 unknown, I'd expect enough overlap for comparison?

Johann-Mattis List notifications@github.com schrieb am Do., 4. Juni 2020, 07:07:

So in this case, those unknown sounds should just be excluded? This is already handled by the code, as the selection by "aspects" currently does not capture unknown sounds. I just wonder if it obscures the results, if one has too many of them unknown sounds in a given inventory?

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/cldf-clts/pyclts/pull/15#issuecomment-638605244, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKBE42R5ZGMOHV4WP2LRU4T2ZANCNFSM4NRZN2YQ .

xrotwang commented 4 years ago

One could maybe use the number or ratio of unknown sounds as a weight for comparison.

Robert Forkel xrotwang@googlemail.com schrieb am Do., 4. Juni 2020, 07:12:

Too many unknown sounds would mean inventories are not amenable to comparison with CLTS methods. But a threshold for what's too many might be interesting. E.g. if an inventory has 30 known sounds and 30 unknown, I'd expect enough overlap for comparison?

Johann-Mattis List notifications@github.com schrieb am Do., 4. Juni 2020, 07:07:

So in this case, those unknown sounds should just be excluded? This is already handled by the code, as the selection by "aspects" currently does not capture unknown sounds. I just wonder if it obscures the results, if one has too many of them unknown sounds in a given inventory?

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/cldf-clts/pyclts/pull/15#issuecomment-638605244, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKBE42R5ZGMOHV4WP2LRU4T2ZANCNFSM4NRZN2YQ .

LinguList commented 4 years ago

Well, in some sense, the weight is what I originally thought could be done. A normal jaccard distance can be handled like the hamming distance, if you align the two sets (but order doesn't count) and identify matches. This is in fact the logic I use for the similarity now (only with similarities not with distances): unmatched parts are given score 0, similar to a gap in an alignment analysis (which is also a specific form of Hamming). So if one includes the unknown sound category in the computation, one can treat it as it would have been treated in an original Jaccard-distance comparison of features: it yields 0. An unknown sound is never scoring with another unknown sound. As a result, it will raise the distance between inventories.

The other idea to check the strings of unknown sounds can be dropped: it is in fact counteracting the idea of clts: if we find many identical sound strings in phoible and another larger dataset, we'd hope we can extend clts, of course.

But there's another problem: In the current setting, unknown sounds cannot be included in any group (aspect), unless one uses one group containing all sounds (not there now), since they don't have any features, like "consonant", "vowel" etc. To include unknown sounds, one could add them as an extra group, but with vowels and consonants scores, this would amount to 1/3 of the score, and a group of unknown sounds present in one of the two datasets always amounts to 0. Of course, we could ask the original datasets to tell us what sound class their sounds are. But this would get more and more complicated, so the more I think about it, we could just leave it as is for the time being and see where it leads us, by tracking the number of unknown sounds for each pair of languages we compare. Fact is, that unknown sounds are rare, we cover almost 90% of phoible now, and the rest is questionable anyway, only occurring in a few languages.

LinguList commented 4 years ago

ah, and reg. the number ratio: that can indeed be done. But we could also first do it in concrete examples and if it turns out to be useful add it here? With @tresoldi having added the nicholaev now, I am quite interested in seeing how well it compares with phoible.

LinguList commented 4 years ago

I propose I merge this now, and we can then see how we handle unknown sounds in particular, if needed.