Implement parallelisation

ssb22 commented 3 years ago

PrimerPooler is supposed to be fast, and part of its speed comes from parallelisation. The Java port is currently not parallelised, so will be slower than the C version (and I'd like users to know if the C version is many times faster☺)

Java does not have OpenMP, but we may be able to port our use of it into standard Java threads.

For dGandScoreCounts and dGprintBonds, in each case the #pragma omp parallel simply means "find out how many cores the CPU has (I think it's Runtime.getRuntime().availableProcessors() in Java) and start that number of threads, each of them running a copy of this block". So to do that in Java we'd have to put the block into a public void run() method of a class that 'extends Thread, then create N instances of that class, call.start()on each (so they all start), then call.join()` on each (so we wait till they've all finished).

Each thread will want to know what its thread number is, so we'd best pass the loop counter in to the constructor and save it somewhere so we can pass it in to the call to Triangle.t_iBounds(np) (which needs to know thread number and total number of threads, like "we are thread number 3 of 4", so it can calculate a sensible share of the run for itself: the code to do this is already ported, we just need to set its nThreads and tNum to something other than 1 and 0).

And then there is the matter of the #pragma omp critical section, which is equivalent to synchronized in Java. So that block would have to go into a method of AllPrimers (or some other class of which there is only one instance), declared synchronized so that only one thread at a time can run it (with parameters bucket and score), and the counts, minScore and maxScore arrays will also have to be looked after by that object. Similarly for dGprintBonds which needs to synchronize access to sr in the same way.

poolsplit_thread in Splitter.java can also be parallelised in this way, but beware it has multiple critical sections (and each thread needs its own copy of all parameters, rather than having them all shared in the Splitter object). Note also the use of ThreadRand which must return a different random sequence to each thread (otherwise there's no point in parallelising here); this is also why the C version does not parallelise here if seedless is set.

The most difficult part will be the parallelisation of the genome scan. In this case, we cannot rely on "share the chromosomes equally between the threads" being a good strategy (we don't know these chromosomes are the same lengths, they might be very different). So the C version uses #pragma omp parallel for schedule(dynamic) which basically means "create N threads (one per core), give one chromosome to each thread, and then whenever any thread finishes, give it another chromosome, until there are no chromosomes left, then wait for all threads to finish".

ssb22 commented 3 years ago

Something like this might help with the implementation of #pragma omp parallel for schedule(dynamic) in the genome scan: https://stackoverflow.com/a/31885029 that would allow us to wait for the first of N threads to finish, which would then allow us to start a new thread when one of the N has finished (a little more overhead than assigning new work to an already-running thread, but negligible in this case).

ahmedihafez commented 3 years ago

Without doubts, The C version is faster :), even when we will implement parallelisation in The Java version. I would like to see this option implemented for sure, I guess you already covered most of the details on your comments. I will start to make this implementation and hope to have it soon. Many thanks for your help.

ahmedihafez / primers-java

Implement parallelisation #8