digling / burmish

LingPy plugin for handling a specific dataset
GNU General Public License v2.0
1 stars 1 forks source link

surprizes in directed networks #110

Open nh36 opened 7 years ago

nh36 commented 7 years ago

In some cases the reconstruction given by the directed network is a surprise. In particular, ID 151 "the fog" is being reconstructed with a v- initial, whereas in the sound change specification file it says w > v, so this should be reconstructed as a v-.

Another case is ID 380 "the fish" where it should be reconstructing *ts- since I already specified that ts > t as an initial. So, It is not clear what is happening.

A third case is ID 187 "horizontal" where I expect *ḭ: to be reconstructed by the directed networks, but it isn't. We do have "n ḭ: ḭ" in the 'change direction' file.

LinguList commented 7 years ago

We have two bad categories of consensus patterns:

Both types of patterns are not helping in reconstruction and should be avoided or banned during the process, but to bann them completely, we can't do without manual refinement, I'm afraid.

nh36 commented 7 years ago

So the algo only works if the pattern is there for all 8 languages. I suppose the problem is that I do not sufficiently understand what the algo is doing. And maybe this will change when you start to use the networks for it. At the moment there are cases, like these, where I am not able to follow the thinking of the computer myself. I suppose you will be writing about this somewhere soon.

LinguList commented 7 years ago

The algorithm is doing an extremely simple thing, which covers 10 lines in Python:

Whenever it returns False, it will fall back to majority rule. This is easy to see now, as it is marked. There are two reasons:

  1. the subgraph is not connected (your network doesn't cover the transition, problem of writing networks not independently, but based on the output)
  2. the subgraph does not have a source

I recommend, if you want to check this fully, that you look at EACH of ALL of the patterns in the file and each time you extract the necessary changes. This would be a lot of work, but it would at least be an exhaustive graph. Ideally, in such a situation, you would also just give your expected outcome to the patterns (this could even be straightforward: annotating 500 patterns should be doable in under one hour, I guess). Then we had something to compare.

Right now, you are disappointed or surprised because certain points don't run as expected, but I think you underestimate to which degree there are strict decisions and most of the time it's just the data that fails.

LinguList commented 7 years ago

I recommend to do it like this:

  1. annotate all patterns with expected outcome manually (I'd anyway be keen on having this), and please annotate as well those cases where we have sounds which are NOT attested, as this is what we'll anyway need!
  2. discard patterns which occur only one time (this will make this really fast).
  3. write a list of all potential transitions between sounds (both directions), which is restricted by sounds which occur in at least one pattern (this will prevent vowels from matching with consonants).
  4. discard impossible sound changes from this exhaustive list.
  5. testing this new list on the data.

I can produce (but not this week) the list of possible sound changes. YOu could in the meantime start to just annotate all of the 500 (- singletons) patterns, to give us a more objective way to test.