emreg00 / toolbox

Toolbox - generic utilities for data processing (e.g., parsing, proximity, guild scoring, etc...)
108 stars 59 forks source link

Running speed of calculate the proximity of drug disease pair #3

Open gllspeed opened 1 year ago

gllspeed commented 1 year ago

Hi, Emre, I used your code to reproduce the similarity between drugs and diseases in your paper. It takes 1000 times to calculate the mean and standard deviation of the distance of a drug disease pair, so the code runs very slowly. Is there any other way to speed up the running speed. Look forward to your help to answer, thank you!

emreg00 commented 1 month ago

Hi Gllspeed,

Sorry for my delayed response.

In principle you can reduce the number of randomizations (i.e., 200 or even 100) at the cost of weakening the statistical confidence you have on the z-scores. Then you can re-run for those that look promising (i.e. focusing on those that have z < -0.5 or z < -1, depending on the application / scale-freeness of the input network) with higher number of randomizations (i.e., 1000) to get more reliable z-scores.

Another way to increase the running time is to load all pairwise shortest paths into the memory (instead of calculating it for each pair separately) -- that is provided that you have sufficient memory. It will take a while to calculate / load all paths but once it is done, the randomizations would be much faster (as the shortest paths would imply a lookup to the shortest paths dictionary).

Hope this helps.