Perl MCE tips - Githubissues

chrisarg / bio-seqalignment-examples-enhancing-edlib

Parallelizing the Edlib sequencing library using MCE and OpenMP in Perl

Other

0 stars 0 forks source link

Perl MCE tips #1

Closed marioroy closed 3 months ago

marioroy commented 3 months ago

Hi Christos,

I want to share helpful clarification with regards to MCE chunking.

Re: https://github.com/chrisarg/bio-seqalignment-examples-enhancing-edlib/blob/main/scripts/timings_chunk.png

Q. Why does greater chunk_size run slower?

A. This can be seen monitoring top/htop. Notice near the end of the job, a worker will finish, then two completed, and so on. The Perl examples benefit from chunk_size => 1, because seq_align involves much CPU time. Increasing chunk_size results in idle workers at the end of the job to wait longer for the remaining workers to complete processing.

The following is a small demonstration measuring MCE's inter-process communication. For your examples, there is little reason to increase chunk_size unless processing greater than 100,000 or each element involving fast computation.

use v5.030;
use MCE;
use MCE::Candy;
use Time::HiRes qw(time);

my @workload = ( 1 .. 2000 );
my @results;

my $mce = MCE->new(
    max_workers => MCE::Util::get_ncpu(),
    chunk_size  => 1,
    gather      => MCE::Candy::out_iter_array( \@results ),
    posix_exit  => 1,
    user_func   => sub {
        my ( $mce, $chunk_ref, $chunk_id ) = @_;
        my @chunk_results;
        for my $chunk ( @{$chunk_ref} )  {
            push @chunk_results, $chunk * 2;
        }
        $mce->gather( $chunk_id, @chunk_results );
    }
);

my $start = time;
$mce->process(\@workload);
my $stop = time;
$mce->shutdown;

printf "process time: %.3f seconds\n", $stop - $start;

Time:

@workload = ( 1 .. 2000 );
process time: 0.030 seconds  # chunk_size => 1
results size: 2000

@workload = ( 1 .. 1_000_000 );
process time: 7.209 seconds  # chunk_size => 1
results size: 1000000

process time: 1.708 seconds  # chunk_size => 5
results size: 1000000

process time: 0.963 seconds  # chunk_size => 10
results size: 1000000

process time: 0.485 seconds  # chunk_size => 20
results size: 1000000

chrisarg commented 3 months ago

Mario, thank you for reaching out and for the kind words!. If I understand your suggestion correctly, I should probably be calling shutdown explicitly to avoid the persistence of the workers and inflating the execution time. I will rerun the examples with explicit reaping of the workers and get the timing information into the presentation.

MCE is amazing ; it fits a very nice niche between OpenMP and MPI for High Performance Computing. Frankly, I can see how one would use MPI for internode communication and some strategy of oversubscribing through the combination of MCE+OpenMP to squeeze performance out of a single node.

There are some other stuff I will not have time to cover in the talk, i.e the very promising experiments with inlined SIMD assembly code under the control of MCE. Those will be released later in the year :)

On Thu, Jun 13, 2024 at 12:15 AM Mario Roy @.***> wrote:

The other day, I noticed your module on metacpan and took a look at how MCE is used.

Re: SeqMapping/Dataflow modules https://github.com/chrisarg/bio-seqalignment-components-seqmapping/tree/main/lib/Bio/SeqAlignment/Components/SeqMapping/Dataflow

Your MCE demonstrations are amazing. I want to share a tip. Omitting shutdown results in idled MCE workers to linger around until the completion of the script or termination.

Q. Does MCE reap workers automatically before leaving the scope? Yes, since MCE v1.896. A. MCE saves the reference internally, used by various class methods i.e. MCE->print, MCE->printf, MCE->say, MCE->wid, et cetera.

$mce->run( 1, { input_data => @. ) # includes reaping workers, shutdown$mce->run( 0, { input_data => @. ) # workers persist, same as $mce->process $mce->process( @.*** ); # workers persist$mce->shutdown(); # necessary to reap workers

I released MCE v1.896 to reap workers automatically upon leaving the scope, in the event not calling $mce->shutdown internally i.e. run(0), or explicitly.

— Reply to this email directly, view it on GitHub https://github.com/chrisarg/bio-seqalignment-examples-enhancing-edlib/issues/1#issuecomment-2164552233, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNPLZ63M6T5GUGR2WCPCQDZHE2JXAVCNFSM6AAAAABJHUVFWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRUGU2TEMRTGM . You are receiving this because you are subscribed to this thread.Message ID: <chrisarg/bio-seqalignment-examples-enhancing-edlib/issues/1/2164552233@ github.com>

marioroy commented 3 months ago

Hi Christos,

If I understand your suggestion correctly, I should probably be calling shutdown explicitly...

Yes, preferably. The recent MCE v1.896 release weakens the internal MCE reference so that workers are reaped immediately when leaving the scope, in the event omitting calling shutdown explicitly.

Thanks for the kind words.

chrisarg commented 3 months ago

Thanks Mario for flagging , I will take a look; it is likely that the library building function (the one you had to activate openMP) forgets to reset something within OMP to allow the threads to be used (I should also check in an i7 just to make sure that the Dual Xeon is not special somehow) The general thinking from HPC books is that OpenMP is a low barrier, moderate enhancer of performance (usually one peaks around 2-4 threads depending on code). If you have the time, can you run a scenario with 16 workers and 4 threads and 32 workers and 2 threads? It may be informative

On Thu, Jun 13, 2024 at 2:42 PM Mario Roy @.***> wrote:

I tried your testMapMCE_openMP.pl example without success i.e. segfault calling C_INDEX => make_C_index( @.*** ).

Searching found the actual C function inside lib/Bio/SeqAlignment/Components/Libraries/edlib/OpenMP.pm. I ended up disabling OpenMP, just this function.

SV _make_C_index(AV sequences) { int i; int n = av_len(sequences) + 1; // _ENV_set_num_threads(); size_t buffer_size = n sizeof(Seq); // how much space do we need? NEW_SV_BUFFER(retval, buf, buffer_size); Seq RefDB = (Seq )buf; // #pragma omp parallel // { / pid_t pid = getpid(); printf("Process ID: %d, Threads %d out of %d\n",pid, omp_get_thread_num(), omp_get_num_threads()); */
    int nthreads = omp_get_num_threads();
    size_t thread_id = omp_get_thread_num();
    size_t tbegin = thread_id * n / nthreads;
    size_t tend = (thread_id + 1) * n / nthreads;
    for (size_t i = tbegin; i < tend; i++) {
        SV** elem = av_fetch_simple(sequences, i, 0); // perl 5.36 and above
        STRLEN len = SvCUR(*elem);
        RefDB[i].seq_address = (uintptr_t)SvPVbyte_nolen(*elem);
        RefDB[i].seq_len = len;
    }
// } return retval; }

The above change allowed me to run the OpenMP demonstration. Interesting, indeed. The results were captured on an AMD Ryzen Threadripper 3970X machine, DDR4 3600 MHz, Ubuntu Linux 24.04.

testMapMCE_openMP.pl

my @workers = (1, 2, 4, 8); my @threads = (4, 8); Max workers: 1, Num threads: 4, took 47.23 seconds Max workers: 1, Num threads: 8, took 35.48 seconds Max workers: 2, Num threads: 4, took 24.14 seconds Max workers: 2, Num threads: 8, took 18.02 seconds Max workers: 4, Num threads: 4, took 12.24 seconds Max workers: 4, Num threads: 8, took 9.32 seconds Max workers: 8, Num threads: 4, took 6.54 seconds Max workers: 8, Num threads: 8, took 5.70 seconds

my @workers = (32); my @threads = (2); Max workers: 32, Num threads: 2, took 4.87 seconds

testMapMCE_overC.pl

my @workers = ( 4, 8, 16, 32, 64 ); my @chunk_sizes = ( 1 ); Workers Time Chunk_Size 4 42.029340982437 1 8 21.610571146011 1 16 11.502110958099 1 32 7.0929610729218 1 64 5.4639630317688 1

— Reply to this email directly, view it on GitHub https://github.com/chrisarg/bio-seqalignment-examples-enhancing-edlib/issues/1#issuecomment-2166734320, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNPLZ6NNMBXJYDWBWM7M7DZHH745AVCNFSM6AAAAABJHUVFWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRWG4ZTIMZSGA . You are receiving this because you commented.Message ID: <chrisarg/bio-seqalignment-examples-enhancing-edlib/issues/1/2166734320@ github.com>

chrisarg commented 3 months ago

Mario, I tracked down the issue: The original code would throw a segmentation fault in systems that did not define the environmental variable OMP_NUM_THREADS. In such systems the segmentation would arise when the reference sequence database was being indexed using an OpenMP parallel region. When the parallel region was encountered, the code would attempt to fork the tasks to ZERO threads leading to the segfault. By including an explicit initiation of the number of OMP threads from within this script the segmentation fault is averted. This line was included in the testMapMCE_openMP.pl file. The CPAN and github distros were updated. Let me know if he parallel database creation works ok for you now I also updated the SeqMapping to explicitly reap the workers

############################################################################### my $openmp_env = OpenMP::Environment->new; $openmp_env->omp_num_threads(1); ## at least one thread for systems w/o OMP env ###############################################################################

On Thu, Jun 13, 2024 at 2:42 PM Mario Roy @.***> wrote:

I tried your testMapMCE_openMP.pl example without success i.e. segfault calling C_INDEX => make_C_index( @.*** ).

Searching found the actual C function inside lib/Bio/SeqAlignment/Components/Libraries/edlib/OpenMP.pm. I ended up disabling OpenMP, just this function.

SV _make_C_index(AV sequences) { int i; int n = av_len(sequences) + 1; // _ENV_set_num_threads(); size_t buffer_size = n sizeof(Seq); // how much space do we need? NEW_SV_BUFFER(retval, buf, buffer_size); Seq RefDB = (Seq )buf; // #pragma omp parallel // { / pid_t pid = getpid(); printf("Process ID: %d, Threads %d out of %d\n",pid, omp_get_thread_num(), omp_get_num_threads()); */
    int nthreads = omp_get_num_threads();
    size_t thread_id = omp_get_thread_num();
    size_t tbegin = thread_id * n / nthreads;
    size_t tend = (thread_id + 1) * n / nthreads;
    for (size_t i = tbegin; i < tend; i++) {
        SV** elem = av_fetch_simple(sequences, i, 0); // perl 5.36 and above
        STRLEN len = SvCUR(*elem);
        RefDB[i].seq_address = (uintptr_t)SvPVbyte_nolen(*elem);
        RefDB[i].seq_len = len;
    }
// } return retval; }

The above change allowed me to run the OpenMP demonstration. Interesting, indeed. The results were captured on an AMD Ryzen Threadripper 3970X machine, DDR4 3600 MHz, Ubuntu Linux 24.04.

testMapMCE_openMP.pl

my @workers = (1, 2, 4, 8); my @threads = (4, 8); Max workers: 1, Num threads: 4, took 47.23 seconds Max workers: 1, Num threads: 8, took 35.48 seconds Max workers: 2, Num threads: 4, took 24.14 seconds Max workers: 2, Num threads: 8, took 18.02 seconds Max workers: 4, Num threads: 4, took 12.24 seconds Max workers: 4, Num threads: 8, took 9.32 seconds Max workers: 8, Num threads: 4, took 6.54 seconds Max workers: 8, Num threads: 8, took 5.70 seconds

my @workers = (32); my @threads = (2); Max workers: 32, Num threads: 2, took 4.87 seconds

testMapMCE_overC.pl

my @workers = ( 4, 8, 16, 32, 64 ); my @chunk_sizes = ( 1 ); Workers Time Chunk_Size 4 42.029340982437 1 8 21.610571146011 1 16 11.502110958099 1 32 7.0929610729218 1 64 5.4639630317688 1

— Reply to this email directly, view it on GitHub https://github.com/chrisarg/bio-seqalignment-examples-enhancing-edlib/issues/1#issuecomment-2166734320, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNPLZ6NNMBXJYDWBWM7M7DZHH745AVCNFSM6AAAAABJHUVFWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRWG4ZTIMZSGA . You are receiving this because you commented.Message ID: <chrisarg/bio-seqalignment-examples-enhancing-edlib/issues/1/2166734320@ github.com>

marioroy commented 3 months ago

I tracked down the issue: The original code would throw a segmentation fault in systems that did not define the environmental variable OMP_NUM_THREADS.

That's great! You're ready for the talk. Everything works.

If you have the time, can you run a scenario with 16 workers and 4 threads and 32 workers and 2 threads? It may be informative.

I find this interesting.

Alien::SeqAlignment::edlib v0.10
Bio::SeqAlignment::Components::Libraries::edlib v0.02
Bio::SeqAlignment::Components::SeqMapping v0.02
Bio::SeqAlignment::Examples::EnhancingEdlib v0.02

# testMapMCE_openMP.pl

Max workers:  8, Num threads: 8, took 5.71 seconds
Max workers: 16, Num threads: 4, took 4.72 seconds
Max workers: 32, Num threads: 2, took 4.96 seconds
Max workers: 64, Num threads: 1, took 5.26 seconds

# testMapMCE_overC.pl

64      5.83489513397217        1

marioroy commented 3 months ago

Another tip! MCE workers may exit faster simply giving the posix_exit option. That will cause workers to avoid all END and destructor processing.

my $mce = MCE->new(
    max_workers => $max_workers,
    chunk_size  => $chunk_size,
    gather      => MCE::Candy::out_iter_array( \@results ),
    posix_exit  => 1,
    user_func   => sub {
        ...
    }
);

# testMapMCE_overC.pl

64      5.83489513397217        1
64      5.42461490631104        1   posix_exit => 1

marioroy commented 3 months ago

Hi Christos,

I met to share one more tip. This one is for improving OpenMP performance. Dividing the work load by the number of OpenMP threads results in idle cores at the end of the job. An OpenMP thread may complete faster only to become idle and waiting for the other threads to finish. The behavior is similar to running MCE over C and chunk size greater than 1.

I'm traveling at the moment. The results were captured on an Intel-based macOS - Turbo Boost disabled and 42 watts power limit (via Volta).

Max workers: 4, Num threads: 2, took 43.14 seconds (before change)
Max workers: 4, Num threads: 2, took 35.41 seconds (better CPU utilization)

Change made to .../edlib/OpenMP.pm (_edlib_align function).

@@ -359,14 +359,10
     omp_get_num_threads());
     */

-    int nthreads = omp_get_num_threads();
-    size_t thread_id = omp_get_thread_num();
-    size_t tbegin = thread_id * n_of_seqs / nthreads;
-    size_t tend = (thread_id + 1) * n_of_seqs / nthreads;
-
     int thread_min_val = INT_MAX;
     int thread_min_idx = -1;
-    for (size_t i = tbegin; i < tend; i++) {
+#pragma omp for schedule(static, 1)
+    for (size_t i = 0; i < n_of_seqs; i++) {
       EdlibAlignResult align =
           edlibAlign(query_seq, query_len, (char *)RefDB[i].seq_address,
                      RefDB[i].seq_len, alignconfig);

chrisarg commented 3 months ago

Thanks Mario for the tip. I didn't get around to doing scheduling, mostly because I was waiting for Bret Estrade's presentation on OpenMP in TPRC. See under example 3 in https://metacpan.org/pod/OpenMP::Environment on how one can set this up from Perl ; on the C side one needs to run the parallel region with runtime scheduling As alignment tasks are going to be variable in compute load (reference sequences all have different lengths) playing with the schedule is paramount (or at least so I think)

christos

On Sat, Jun 15, 2024 at 6:46 AM Mario Roy @.***> wrote:

Hi Christos,

I met to share one more tip. This one is for improving OpenMP performance. Dividing the work load by the number of OpenMP threads results in idle cores at the end of the job. An OpenMP thread may complete faster only to become idle and waiting for the other threads to finish. The behavior is similar to running MCE over C and chunk size greater than 1.

I'm traveling at the moment. The results were captured on an Intel-based macOS - Turbo Boost disabled and 42 watts power limit (via Volta).

Max workers: 4, Num threads: 2, took 43.14 seconds (before change) Max workers: 4, Num threads: 2, took 35.41 seconds (better CPU utilization)

Change made to .../edlib/OpenMP.pm (_edlib_align function).

@@ -359,14 +359,10 omp_get_num_threads()); */

int nthreads = omp_get_num_threads();- size_t thread_id = omp_get_thread_num();- size_t tbegin = thread_id n_of_seqs / nthreads;- size_t tend = (thread_id + 1) n_of_seqs / nthreads;- int thread_min_val = INT_MAX; int thread_min_idx = -1;- for (size_t i = tbegin; i < tend; i++) {+#pragma omp for schedule(static, 1)+ for (size_t i = 0; i < n_of_seqs; i++) { EdlibAlignResult align = edlibAlign(query_seq, query_len, (char *)RefDB[i].seq_address, RefDB[i].seq_len, alignconfig);

— Reply to this email directly, view it on GitHub https://github.com/chrisarg/bio-seqalignment-examples-enhancing-edlib/issues/1#issuecomment-2169510978, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNPLZZ6YOEIAY2FSVFXMRLZHQZQ3AVCNFSM6AAAAABJHUVFWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRZGUYTAOJXHA . You are receiving this because you commented.Message ID: <chrisarg/bio-seqalignment-examples-enhancing-edlib/issues/1/2169510978@ github.com>

chrisarg commented 3 months ago

Mario, This is a good experiment. I refrained from using the #program omp for because a) I am not comfortable writing my own reductions YET 2) one cannot provide custom reduction functions at runing (so ended up using a critical region, which can be at least modified to call a function pointer). If you have time, can you modify your code to run it as #pragma omp for schedule(dynamic, 1) ? On my Xeon dynamic scheduling gives a better performance than a static one

Christos

On Sat, Jun 15, 2024 at 8:30 AM Mario Roy @.***> wrote:

Closed #1 https://github.com/chrisarg/bio-seqalignment-examples-enhancing-edlib/issues/1 as completed.

— Reply to this email directly, view it on GitHub https://github.com/chrisarg/bio-seqalignment-examples-enhancing-edlib/issues/1#event-13169372949, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNPLZ2QYCENY26Z4ZCS363ZHRFXVAVCNFSM6AAAAABJHUVFWSVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGE3DSMZXGI4TIOI . You are receiving this because you commented.Message ID: <chrisarg/bio-seqalignment-examples-enhancing-edlib/issue/1/issue_event/13169372949 @github.com>

marioroy commented 3 months ago

This is a good experiment. If you have time, can you modify your code to run it as #pragma omp for schedule(dynamic, 1) ?

Yes, indeed. And yes, dynamic runs ~ 0.2 seconds faster on my laptop machine. I also have the MCE option posix_exit => 1 i.e. SeqMapping/Dataflow/ modules.

# Intel MacBook - Turbo Boost Disabled, Power Limit 42 watts

Max workers: 8, Num threads:  2, took 30.10 seconds - static
Max workers: 8, Num threads:  2, took 29.90 seconds - dynamic
Max workers: 4, Num threads:  4, took 29.98 seconds - dynamic
Max workers: 2, Num threads:  8, took 30.18 seconds - dynamic
Max workers: 1, Num threads: 16, took 30.96 seconds - dynamic

# Intel MacBook - Turbo Boost Enabled, Power Limit not limited

Max workers: 8, Num threads:  2, took 20.32 seconds - static
Max workers: 8, Num threads:  2, took 20.11 seconds - dynamic
Max workers: 4, Num threads:  4, took 20.16 seconds - dynamic
Max workers: 2, Num threads:  8, took 20.38 seconds - dynamic
Max workers: 1, Num threads: 16, took 20.72 seconds - dynamic

so ended up using a critical region

The critical region is fine. There is one more tip and missed mentioning earlier. Add nowait to the omp for pragma. This will allow completed threads to enter the critical block sooner, thereby reducing more time versus waiting for all threads to complete the for block.

#pragma omp for schedule(dynamic, 1) nowait

Max workers: 1, Num threads: 16, took 20.72 seconds
Max workers: 1, Num threads: 16, took 20.52 seconds - nowait

marioroy commented 3 months ago

Pinning MCE workers and OpenMP threads to a NUMA node may be helpful. The following is a tip for NUMA architectures with two or more CPU sockets. Obtain NUMA cpu lists at the top of the SeqMapping/Dataflow modules.

my @numa_nodes;

BEGIN {
    if ( $^O eq 'linux' ) {
        @numa_nodes = qx{ lscpu | awk '/^NUMA node[0-9]/ { print \$NF }' };
        chomp @numa_nodes;
    }
}

Add the user_begin option to MCE.

user_begin  => sub {
    my ( $mce ) = @_;
    if ( $^O eq 'linux' && scalar(@numa_nodes) > 1 ) {
        my $index = ($mce->wid - 1) % scalar(@numa_nodes);
        my $cpu_list = $numa_nodes[ $index ];
        my $taskset_cmd = "taskset --cpu-list --pid $cpu_list $$";
        my $taskset_result = qx( $taskset_cmd 2>/dev/null );
    }
},

An extra check is needed to do this for testMapMCE_openMP.pl only.

I will try this when I get back home in a week.

chrisarg commented 3 months ago

Implemented some of these changes. First, compared the MCE mappers with posit_exit=>0 vs posit_exit=>1 [image: timings_MCE_Exit.png] The change does not make much difference, but seems a good optimization theoretically, so will retain

Then compared the old OMP scenario v.s. static/nowait and dynamic/nowait while changing the parallel code to use #pragma omp fot (rather than the explicit calculation of start/stop iterations). Scenario for one worker and up to 72 threads below

[image: timings_OMP_Code.png] for/dynamic/nowait seems to make a difference, so I moved to examine the combo of MCE with workers 2-72 and threads 2-8.

[image: timings_OMP_WorkersXThreads_Code.png] Regression analysis shows that for/static/nowait is 4% slower than for/dynamic/nowait and the old method with the parallel region and without a #pragma omp for is 12.3% slower. v0.03 updated to for/dynamic/nowait. I refrained from using the taskset and numactl suggestions as I have to think about how best to integrate them to the OO design of the packages. I suspect that affinity of OMP cores is also going to play a big role there. My back of envelope calculations shows that with 72 workers/threads and without oversubscribing, the optimal performance is ~ 2.5 - 3 sec on my machine. If one can't overcome the memory speed issues created by the NUMA, we will end up with something like 5.3sec on this machine. The best performance in these datasets is 5.7sec, so we are really close to that optimum. If numactl and taskset can be integrated in a manner that does not screw up with the interface (may not be possible with the "Generic" modules, but may be possible with the final application versions , we will end up improving performance by another 100%. I may take the C versions of the program through the Intel Profiling tools to see what the roofline diagram of the code looks like.

On Sun, Jun 16, 2024 at 11:16 AM Mario Roy @.***> wrote:

Pinning MCE workers and OpenMP threads to a NUMA node may be helpful. The following is a tip for NUMA architectures with two or more CPU sockets. Obtain NUMA cpu lists at the top of the SeqMapping/Dataflow modules.

my @numa_nodes; BEGIN { if ( $^O eq 'linux' ) { @numa_nodes = qx{ lscpu | awk '/^NUMA node[0-9]/ { print \$NF }' }; chomp @numa_nodes; } }

Add the user_begin option to MCE.

userbegin => sub { my ( $mce ) = @; if ( $^O eq 'linux' && @._nodes) > 1 ) { my $index = ($mce->wid - 1) % @._nodes); my $cpu_list = $numa_nodes[ $index ]; my $taskset_cmd = "taskset --cpu-list --pid $cpu_list $$"; my $taskset_result = qx( $taskset_cmd 2>/dev/null ); } },

An extra check is needed to do this for testMapMCE_openMP.pl only.

I will try this when I get back home in a week.

— Reply to this email directly, view it on GitHub https://github.com/chrisarg/bio-seqalignment-examples-enhancing-edlib/issues/1#issuecomment-2171775086, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNPLZ3VIIHGQL6KHMFDUB3ZHXB6FAVCNFSM6AAAAABJHUVFWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRG43TKMBYGY . You are receiving this because you commented.Message ID: <chrisarg/bio-seqalignment-examples-enhancing-edlib/issues/1/2171775086@ github.com>

marioroy commented 3 months ago

Tuning 8 workers/8 threads runs best on the AMD box. I ran progressively the various optimizations. Thus, posix_exit includes for/dynamic/nowait.

AMD Ryzen Threadripper 3970X 32-Cores, 64-Threads
3600 MHz DDR4-CL16

workers:  8, threads:  8

before..............................:  5.31 seconds
omp for schedule(static,1) nowait...:  3.86 seconds
omp for schedule(dynamic,1) nowait..:  3.66 seconds
posix_exit => 1.....................:  3.62 seconds