digling / burmish

LingPy plugin for handling a specific dataset
GNU General Public License v2.0
1 stars 1 forks source link

refactor code for clique optimization #67

Closed LinguList closed 7 years ago

LinguList commented 7 years ago

Goal is: assign all patterns to their LARGEST clique. This is greedy and may be wrong, but it will be corrected later on, when then re-applying the reconstruction mechanism to resolve the patterns, where we will reconstruct all possible sounds for a given pattern, following classical Reconstruction Engine Style (Mazaudon and Lowe 199?).

Here's the basic idea in code for clique optimization:

cliques = sorted(nx.find_cliques(graph), key=lambda x: len(x), reverse=True)
visited = []
idx = 1
while cliques:
    next_clique = cliques.pop(0)
    for node in [n for n in next_clique if n not in visited]:
        graph.node[node]['clique'] = idx
        visited += [node]
    idx += 1
    cliques = sorted(cliques, key=lambda x: len([y for y in x if y not in visited]), reverse=True)

This should do the trick, I hope: visited stores visited nodes, and clique-size is re-ordered, based on re-counting cliques by taking visited nodes into account. By storing visited nodes, no node should be re-assigned a clique another time.

LinguList commented 7 years ago

It works! It takes some long time, as the sorting is apparently time consuming, but I think this is definitely what needs to be done here, at least it is completely accurate, and waiting 30 seconds does not really bother.

LinguList commented 7 years ago

Okay, now it is important to consider the results and the caveats. It seems, that we get some rather "empty" patterns, where we have only cognates in Burmese, Written Burmese, and Old Burmese, but none in the rest. these are of course still a clique and compatible, but less interesting. We need to think of ways how to

  1. enhance the general clique algorithm (although it is the most consistent by now)
  2. evaluate patterns (search for these skewed patterns, and try to re-assign them, potentially to some other clique)
  3. check for potential errors in alignments that might sum up to huge problems in the output

There should be a "consistency score" for each clique, based on pervasiveness of regularity (how many identical columns, etc.).

LinguList commented 7 years ago

marking this now as solved, and opening further issue for clique evaluation

LinguList commented 7 years ago

problem detected: our min_tax-criterion prevents existing cliques from being merged. We need a second run, with the consensus of a given clique, during which we search for further compatibilities!

LinguList commented 7 years ago

Same procedure was mentioned here:

But they use an extra loop to afterwards cluster missing nodes.

LinguList commented 7 years ago

Apparently, the problem can also be called "clique covering problem":

LinguList commented 7 years ago

Given that the problem seems to be the inverse of the independent set problem, it may be possible to search for the maximum independent set and then inverse the graph:

LinguList commented 7 years ago

Okay, I found the solution for the problem, and the algorithm was again modified:

The result is a partition of the graph into cliques whose consensuses are incompatible with each other. Currently, we find 165 patterns for consonants (including singletons), 86 contain no gaps, 51 occur 3 times and more, 42 without a gap occur three times and more, we find further:

This tells us, that parts of the data are probably wrongly annotated (about 200 cognate sets of 700 deserve to be rechecked), the phonology is not yet sufficiently represented in all cases (see issue #72 ), and that we might lack the coverage in the data to find the examples.

I'll update patterns in DB, and then I'll try and make a reverse pattern analysis, using the ones which @nh36 gave from the spreadsheet to see how many words can be explained by them.

LinguList commented 7 years ago

For the vowels, there are even more patterns (269!, including singletons!), but about the same number of patterns with occurrence of 3 and higher as for consonants. Vowels might be further simplified by excluding nasality (although this can also impact on change, so it may be useful to keep it).

I'll close this issue for now and then add other issues for the next steps of this algorithm.