Open LinguList opened 6 years ago
Thank you! The code is still very patchy and there is still more work to be done before it even starts to be working, I am sorry for the mess :)
I think the right thing to do here is to add lingpy as a dependency and just use the infomap_clustering
function directly off the shelf.
Although, on the other hand, it might be an overkill to include the whole suite to use a single function, albeit such an important one. The alternative would be to sync our igraph_clustering
function with the original and, of course, be explicit in documenting where it came from. @erathorn, what do you think?
I suppose you'll use sound classes anyway, right? Or do you only use data that was already converted to ASJP characters? If not, you'll have lingpy as a dependency anyway, unless you also extract the sound-class code from there.
Actually, using lingpy's implementation of Needleman-Wunsch makes the PMI algorithm run a lot faster, so we should probably use that than the code that is currently in the repo. However, neither talign.globalalign
nor pairwise.nw_align
seem to allow for setting a fixed gap extension penalty; I guess I am missing something, @LinguList?
With regards to the IPA to ASJP conversion, I am not sure whether the intention is to include that in the repo, but I assume that it would not hurt. @erathorn?
Actually, using lingpy's implementation of Needleman-Wunsch makes the PMI algorithm run a lot faster, so we should probably use that than the code that is currently in the repo. However, neither talign.globalalign nor pairwise.nw_align seem to allow for setting a fixed gap extension penalty; I guess I am missing something, @LinguList?
Yes, you're right: we use a gap-extension-scale rather then a penalty. The function in lingpy you are looking for is algorithm.malign:malign.nw_align, but here, I deliberately excluded the gap-extension penalty, as it is not part of the original NW algorithm (GEP was introduced by Gotoh 1994, as you can see when reading Kondrak's thesis).
But having an extended nw_align in the same spirit is in fact easy to achieve. We may consider to add a function following your specific needs to lingpy.algorithm.malign.
With regards to the IPA to ASJP conversion, I am not sure whether the intention is to include that in the repo, but I assume that it would not hurt. @erathorn?
Here, it depends on what you plan in general with the package. Do you want this to be the package for the online PMI paper? If so, there was the general plan to include it with lingpy-2.7 (Gerhard also wanted his PMI to be available through lingpy-2.7), and in this case, it would not hurt to use lingpy as dependency. If it is planned as a standalone, you may prefer independency.
Regarding the infomap function and sound classes: We sure should add lingpy in the comments for the source. Regarding the vertex weights, I would suggest to leave it in there, and allow people to use it, but set the default to "no". We work directly on ASJP symbols. So there is no dependency on lingpy here.
Regarding the Needlemann Wunsch function and the IPA to ASJP conversion: This is phylostars work as far as I know. We should include him into this discussion.
Regarding the lingpy integration: At first, this code should complement the online PMI paper to allow people to reproduce our work. Therefore, I would prefer to have a minimal standalone in the first place and think about integration into linpy in a second step.
The function's docstring now states that it is taken from LingPy. In my opinion, if we pull in LingPy as a dependency there is no reason not to replace the function with the original; after all, the paper only talks about InfoMap. But on its own, this is not a reason to include the suite, the function is just a wrapper around igraph with a couple of loops to translate the data structures.
The project already requires the user to create a virtual environment and pip-install the dependencies; increasing the latter by the dozen or so that LingPy comes with does not really make it more difficult for the end user, just a few seconds slower to setup.
With regards to the two functions in the align module: I would rather have these either unit tested or replaced, especially the unreadable and relatively slow needleman_wunsch
. The PMI algorithm ran at least twice as fast when I replaced it with LingPy's implementation. Of course, we could also use some other package here, e.g. Biopython. @erathorn, shall we add PhyloStar to the repo?
Please add him to the repo.
If we pull in lingpy, we sure can use their version. I am not sure if we should do it at this point though. If we go for lingpys NW algorithm, we can go for the dependency. But since the NW code is Phylostars work, I would wait for his input.
I am actually glad to hear that I didn't mess the algos up too much, as I was more concerned with getting it cython compatible. It would be in my interest, also with respect to future use of lingpy or a collaboration between our projects etc., to offer a more complex NW function (although I wouldn't call it NW_align or similar, as we're talking about this specific extension by Gotoh, as I mentioned before), but I would be interested in having this better version in lingpy, so that other people can profit from it.
Adding the GEP is actually straightforward, if you take my nw_align code as the basis, right? So what about the following deal: you add this extended nw_align to your package, and I later add this to lingpy, also quoting your package? If you add tests, you may want to have a look at the tests I wrote for nw_align and other packages in the lingpy test module. I think, we are close to 100%, which is not easy, given the multitude of possibilities of alignment, especially those ones which we never encounter in practice. If you add tests to this new improved alignment algo, I'd also gladly re-use them in lingpy. BTW: our policy in lingpy also states that this would qualify you as "collaborators" so you'd be listed in the lingpy board of collaborators for version 2.7 and all updates on version 2.6 before.
I invited PhyloStar as collaborator to the repo.
I could indeed try to implement it in Cython and see how it goes.
check out this thread on lingpy: https://github.com/lingpy/lingpy/issues/295 There, I discuss what I want to do in the future: using a pure-python implementation that also allows for cython. It is beautiful, I just didn't find time to do it for all scripts, but I had a successfull test for one of them. I plan doing this in the future for lingpy 2.7.
I just realized why your NW algorithm is SO slow. Just get rid of all numpy-crap-code. What you need to know is: if you do one alignments of two very, very long strings, numpy.arrays are probably fast, but if you do very, very many alignments of very short strings, you will re- and re-create numpy-arrays each time, and this costs enormously. Just replace np-arrays by a normal list, and it'll gain drastically in spead!
I had numpy-arrays originally in lingpy. Testing on my benchmark was always about 30 minutes. Then, one day, I deleted all arrays by lists, and that day, I thought the code did not work, since it required then only 5 (!) minutes to run through all alignments. Numpy is great for some tasks, but not if you create many of those arrays.
@LinguList, thank you :) I re-wrote the Needleman-Wunsch implementation and the code now runs much faster (down to 4 sec from 19 sec on my machine for the Japanese dataset).
Nice to see that this was indeed the thing that slowed it down...
The function in clustering is originally taken from and builds on lingpy's infomap-clustering (you might want to mention this in the function description), and it has some code that you never use:
First, you don't use the vertex_weights, so why not kick out that part of the algorithm, second, from lingpy 2.6, we also no longer use edge-weights, as it turned out that they don't make a difference, so all can be set to none now. In fact, I just spotted that lingpy still computes vertex-weights, which I'll also change in our code. You should also note that the original idea of representing edge weights by 1-cell in lingpy (given that one needs to convert from distance to similarity), which you retain there, is probably as unjustified as using the logarithm or anything else. 1-cell is probably just too small to make any differene for infomap, as we're in floating point range. The important aspect for all of those cluster-algos is the threshold, if you feed them a complete graph with weights, they'll all return only one community.
Here's the lingpy function for comparison in the algorithm/extra module: