Parallelization - Githubissues

AkiNikolaidis commented 4 years ago

Jake & Paul - I've added a jupyter notebook showing how to use the multithreading package in python.

I recommend the outline of the functions to look something like:

1) Splitting function (parallelized) --> 2) Louvain clustering (saves cluster label file .npy to disk) --> 3) Analysis of reproducibility and predictive value of bagged vs standard clustering (pulls everything to disk.

In other words, the parallelization happens and all the data are created, and then when that's finished a function is run to collect all the outputs and compute the essential reproduicbility metrics and phenotypic comparisons.

As you can see, you need to define what you want to parallelize and then define the parallelization function. You might need to do something like create a hash, or use a random number generator to save unique .npy cluster label files or something like that.

Define the function to parallelize:

`#Multithreading with Pool

import time import multiprocessing

def basic_func(x): if x == 0: return 'zero' elif x%2 == 0: return 'even' else: return 'odd'`

Define multiprocessing function: def multiprocessing_func(x): y = x*x time.sleep(2) print('{} squared results in a/an {} number'.format(x, basic_func(y))) return(y)

Run multithreaded function:

`if name == 'main':

starttime = time.time()
pool = multiprocessing.Pool()
pool.map(multiprocessing_func, range(0,10))
pool.close()
print('That took {} seconds'.format(time.time() - starttime))`

AkiNikolaidis commented 4 years ago

Formatting of the code didn't work- check the python notebook Pool section for the right code

pab2163 commented 4 years ago

@AkiNikolaidis thanks! looks very useful, i'll give this a try soon

pab2163 commented 4 years ago

gave this a try see #8

AkiNikolaidis / BaggingSubtyping

Parallelization #6