evolaemp / online_cognacy_ident

Fast and unsupervised methods for multilingual cognate clustering (Rama, Wahle, Sofroniev, and Jäger)
MIT License
1 stars 1 forks source link

clustering function has unused code #2

Open LinguList opened 6 years ago

LinguList commented 6 years ago

The function in clustering is originally taken from and builds on lingpy's infomap-clustering (you might want to mention this in the function description), and it has some code that you never use:

First, you don't use the vertex_weights, so why not kick out that part of the algorithm, second, from lingpy 2.6, we also no longer use edge-weights, as it turned out that they don't make a difference, so all can be set to none now. In fact, I just spotted that lingpy still computes vertex-weights, which I'll also change in our code. You should also note that the original idea of representing edge weights by 1-cell in lingpy (given that one needs to convert from distance to similarity), which you retain there, is probably as unjustified as using the logarithm or anything else. 1-cell is probably just too small to make any differene for infomap, as we're in floating point range. The important aspect for all of those cluster-algos is the threshold, if you feed them a complete graph with weights, they'll all return only one community.

Here's the lingpy function for comparison in the algorithm/extra module:

pavelsof commented 6 years ago

Thank you! The code is still very patchy and there is still more work to be done before it even starts to be working, I am sorry for the mess :)

I think the right thing to do here is to add lingpy as a dependency and just use the infomap_clustering function directly off the shelf.

pavelsof commented 6 years ago

Although, on the other hand, it might be an overkill to include the whole suite to use a single function, albeit such an important one. The alternative would be to sync our igraph_clustering function with the original and, of course, be explicit in documenting where it came from. @erathorn, what do you think?

LinguList commented 6 years ago

I suppose you'll use sound classes anyway, right? Or do you only use data that was already converted to ASJP characters? If not, you'll have lingpy as a dependency anyway, unless you also extract the sound-class code from there.

pavelsof commented 6 years ago

Actually, using lingpy's implementation of Needleman-Wunsch makes the PMI algorithm run a lot faster, so we should probably use that than the code that is currently in the repo. However, neither talign.globalalign nor pairwise.nw_align seem to allow for setting a fixed gap extension penalty; I guess I am missing something, @LinguList?

pavelsof commented 6 years ago

With regards to the IPA to ASJP conversion, I am not sure whether the intention is to include that in the repo, but I assume that it would not hurt. @erathorn?

LinguList commented 6 years ago

Actually, using lingpy's implementation of Needleman-Wunsch makes the PMI algorithm run a lot faster, so we should probably use that than the code that is currently in the repo. However, neither talign.globalalign nor pairwise.nw_align seem to allow for setting a fixed gap extension penalty; I guess I am missing something, @LinguList?

Yes, you're right: we use a gap-extension-scale rather then a penalty. The function in lingpy you are looking for is algorithm.malign:malign.nw_align, but here, I deliberately excluded the gap-extension penalty, as it is not part of the original NW algorithm (GEP was introduced by Gotoh 1994, as you can see when reading Kondrak's thesis).

But having an extended nw_align in the same spirit is in fact easy to achieve. We may consider to add a function following your specific needs to lingpy.algorithm.malign.

With regards to the IPA to ASJP conversion, I am not sure whether the intention is to include that in the repo, but I assume that it would not hurt. @erathorn?

Here, it depends on what you plan in general with the package. Do you want this to be the package for the online PMI paper? If so, there was the general plan to include it with lingpy-2.7 (Gerhard also wanted his PMI to be available through lingpy-2.7), and in this case, it would not hurt to use lingpy as dependency. If it is planned as a standalone, you may prefer independency.

erathorn commented 6 years ago

Regarding the infomap function and sound classes: We sure should add lingpy in the comments for the source. Regarding the vertex weights, I would suggest to leave it in there, and allow people to use it, but set the default to "no". We work directly on ASJP symbols. So there is no dependency on lingpy here.

Regarding the Needlemann Wunsch function and the IPA to ASJP conversion: This is phylostars work as far as I know. We should include him into this discussion.

Regarding the lingpy integration: At first, this code should complement the online PMI paper to allow people to reproduce our work. Therefore, I would prefer to have a minimal standalone in the first place and think about integration into linpy in a second step.

pavelsof commented 6 years ago

The function's docstring now states that it is taken from LingPy. In my opinion, if we pull in LingPy as a dependency there is no reason not to replace the function with the original; after all, the paper only talks about InfoMap. But on its own, this is not a reason to include the suite, the function is just a wrapper around igraph with a couple of loops to translate the data structures.

The project already requires the user to create a virtual environment and pip-install the dependencies; increasing the latter by the dozen or so that LingPy comes with does not really make it more difficult for the end user, just a few seconds slower to setup.

With regards to the two functions in the align module: I would rather have these either unit tested or replaced, especially the unreadable and relatively slow needleman_wunsch. The PMI algorithm ran at least twice as fast when I replaced it with LingPy's implementation. Of course, we could also use some other package here, e.g. Biopython. @erathorn, shall we add PhyloStar to the repo?

erathorn commented 6 years ago

Please add him to the repo.

If we pull in lingpy, we sure can use their version. I am not sure if we should do it at this point though. If we go for lingpys NW algorithm, we can go for the dependency. But since the NW code is Phylostars work, I would wait for his input.

LinguList commented 6 years ago

I am actually glad to hear that I didn't mess the algos up too much, as I was more concerned with getting it cython compatible. It would be in my interest, also with respect to future use of lingpy or a collaboration between our projects etc., to offer a more complex NW function (although I wouldn't call it NW_align or similar, as we're talking about this specific extension by Gotoh, as I mentioned before), but I would be interested in having this better version in lingpy, so that other people can profit from it.

Adding the GEP is actually straightforward, if you take my nw_align code as the basis, right? So what about the following deal: you add this extended nw_align to your package, and I later add this to lingpy, also quoting your package? If you add tests, you may want to have a look at the tests I wrote for nw_align and other packages in the lingpy test module. I think, we are close to 100%, which is not easy, given the multitude of possibilities of alignment, especially those ones which we never encounter in practice. If you add tests to this new improved alignment algo, I'd also gladly re-use them in lingpy. BTW: our policy in lingpy also states that this would qualify you as "collaborators" so you'd be listed in the lingpy board of collaborators for version 2.7 and all updates on version 2.6 before.

pavelsof commented 6 years ago

I invited PhyloStar as collaborator to the repo.

I could indeed try to implement it in Cython and see how it goes.

LinguList commented 6 years ago

check out this thread on lingpy: https://github.com/lingpy/lingpy/issues/295 There, I discuss what I want to do in the future: using a pure-python implementation that also allows for cython. It is beautiful, I just didn't find time to do it for all scripts, but I had a successfull test for one of them. I plan doing this in the future for lingpy 2.7.

LinguList commented 6 years ago

I just realized why your NW algorithm is SO slow. Just get rid of all numpy-crap-code. What you need to know is: if you do one alignments of two very, very long strings, numpy.arrays are probably fast, but if you do very, very many alignments of very short strings, you will re- and re-create numpy-arrays each time, and this costs enormously. Just replace np-arrays by a normal list, and it'll gain drastically in spead!

LinguList commented 6 years ago

I had numpy-arrays originally in lingpy. Testing on my benchmark was always about 30 minutes. Then, one day, I deleted all arrays by lists, and that day, I thought the code did not work, since it required then only 5 (!) minutes to run through all alignments. Numpy is great for some tasks, but not if you create many of those arrays.

pavelsof commented 6 years ago

@LinguList, thank you :) I re-wrote the Needleman-Wunsch implementation and the code now runs much faster (down to 4 sec from 19 sec on my machine for the Japanese dataset).

LinguList commented 6 years ago

Nice to see that this was indeed the thing that slowed it down...