Closed yxtay closed 8 years ago
I've had similar thoughts and agree it would be good to remove scikit-bio
. The original reason why I used scikit-bio
because it had PCoA (PCoA != PCA) and the original LDAvis
used PCoA and I wanted to retain as much parity as possible. Are you proposing to implement PCoA or to switch to PCA?
Hmm, I am not familiar with what is involved in PCoA. But based on issues 572 and 579 on scikit-bio
repository, the sklearn MDS
can be perhaps be used instead. It also takes in a distance matrix as input.
Looking at the R implementation of LDAvis, the cmdscale()
function is used. Given that the name of the function is "Classical (Metric) Multidimensional Scaling", I suppose it should be alright to use the MDS
from scikit-learn
as a stand in for PCoA
from scikit-bio
.
Try it out and let me know how it compares. You can compare visualizations made with both and also compare the actual coordinates computed to see how similar they are. Let me know if you have any questions. Thanks!
On Jun 15, 2016, at 7:56 PM, YuXuan Tay notifications@github.com wrote:
Looking at the R implementation of LDAvis, the cmdscale() function is used. Given that the name of the function is "Classical (Metric) Multidimensional Scaling", I suppose it should be alright to use the MDS from scikit-learn as a stand in for PCoA from scikit-bio.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Hmm, from my understanding, there is some randomness involved in the MDS fitting. Not too sure exactly how we are going to evaluate the comparison at the moment, but I would like to point out in advance that the coordinates will move on different runs of MDS.
You should be able to pass in a random state to get determinism.
If it is the same MDS as R (and scikit-bio) then PCA will be used and the resulting basis will be the same. The components (eigenvectors) may have opposite signs due to implementation details which was the case with R and scikit-bio. That is why in my test that compares the R output pyLDAvis output I run the basis through abs
as a quick workaround.
With the test I linked to above you don't need to actually write any additional tests. Just run that one to see if it still lines up. If it doesn't but the visualization still looks correct then we can decide what to do then.
Here's the result on my attempt. The coordinates look, shall I say, quite different. I'm not too sure where to go from here. @bmabey, let me know what you think.
Does PCoA()
produce almost exactly the same coordinates as cmdscale()
in R?
Hmm, just went through the github for sklearn.
The current MDS implementation only includes SMACOF algorithm for metric MDS and there isn't an implementation of SVD for classical MDS, which is the one used by LDAvis in R. Referencing issue 1453 and pull request 4485, looks like this is being looked into. Not sure how long it will take as it has been stuck since Apr 2015, unfortunately.
Option now is we hold out, or smack on a quick implementation following something like this: http://www.nervouscomputer.com/hfs/cmdscale-in-python/ while we wait.
Yes, the R and scikit-bio implementation line up almost exactly (see my earlier comments about the test).
On Jun 17, 2016, at 6:11 AM, YuXuan Tay notifications@github.com wrote:
Here's the result on my attempt. The coordinates look, shall I say, quite different. I'm not too sure where to go from here. @bmabey, let me know what you think.
Does PCoA() produce almost exactly the same coordinates as cmdscale() in R?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I like the idea of including our own implementation. Perhaps we could just lift the relevant code and modify it from scikit-bio (retaining license and attribution info of course).
On Jun 18, 2016, at 2:35 AM, YuXuan Tay notifications@github.com wrote:
Hmm, just went through the github for sklearn.
The current MDS implementation only includes SMACOF algorithm for metric MDS and there isn't an implementation of SVD for classical MDS, which is the one used by LDAvis in R. Referencing issue 1453 and pull request 4485, looks like this is being looked into. Not sure how long it will take as it has been stuck since Apr 2015, unfortunately.
Option now is we hold out, or smack on a quick implementation following something like this: (http://www.nervouscomputer.com/hfs/cmdscale-in-python/) while we wait.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
So, I attempted to implement classical multidimensional scaling following the codes in scikit-bio and the sklearn PR. Works, but it's not thoroughly tested.
http://nbviewer.jupyter.org/github/yxtay/pyLDAvis/blob/mds-implementation/notebooks/pyLDAvis_mds.ipynb Under the heading CMDS. Looks ok right now.
I propose to remove the dependencies on scikit-bio.
scikit-bio has recently undergo incompatible changes to the API, especially with regards to the
pcoa()
function. In addition, related to issue 57, it is still incompatible with Windows machines.After going through the codes, I see that only the
pcoa()
andDistanceMatrix()
functions are used from the scikit-bio package. These can be reimplemented with functions from scikit-learn only. Given the maturity of the sklearn package, it should be a good idea.I can try to implement these portions if necessary.