Closed NikTuzov closed 6 years ago
We fixed this. See the updated code. thanks
Hi Nik,
We now avoid recomputing the powered operator (A^t) after picking the optimal t. So now we iterate from t=1 to t=t_max and stop when we have reached convergence.
Instead of computing the correlation between t and t+1 we now compute procrustes (optimal linear transformation) between t and t+1 and use the transformation error (difference between t and t+1 after transformation) as the measure for convergence. The reason for this is that we now do fast and scalable magic by doing the imputation on the PCA. So at each t we compute procrustes between the PCA at t and at t+1.
Your bias-variance trade-off is interesting and it is something we have thought about but haven't been able to materialize. The way we pick the "optimal" t is based on the idea that the first diffusion steps remove (Gaussian/dropout) noise and only at later t are we starting to remove actual signal. This will produce different behavior at different t. So by looking at the convergence or change between consecutive steps we can find the point where most of the initial noise is removed.
Hello:
Please answer a question about how the automatic diffusion time is determined. I can see that (please correct me if I am wrong) the idea is that if the R^2 between the values imputed at (t-1) and t reaches 95%, t is picked as the optimal time. It looks like it is done to avoid unnecessary computation that is not going to change the result much anyway.
However, when the user chooses the auto option, the t is capped at 15. Using the "square and multiply" method, it shouldn't take much time to compute A^15 as A^8 A^4 A^2 * A. Only 6 multiplications are required for that. Therefore, you might as well have set the diffusion time to 15 if the user doesn't want to bother with that parameter. What you did makes sense if t_max were set to 100, but not for 15. Does this sound right to you or I am missing something?
Secondly, that way of setting t is aimed at computational "optimality", but it does nothing to address the fact that t is a smoothing parameter (along with k and ka). Properly speaking, the "optimal" value of t should guarantee that the amount of smoothing applied to the data is neither "too much" nor "too little". In that case, one has to do something like cross-validation and then pick the value of t that results in the best bias-variance trade-off. This is more of a theoretical remark on my part because it probably requires too much computation to implement CV in this case.
Regards, Nik Tuzov