Open miladsade96 opened 4 years ago
OpenMP is specified using pragmas which are comments, so if the file is compiled without openmp enabled, it will just be ignored. So I think it's fine if we allow openmp in our code base, because if users do not want stdlib to be parallelized using openmp, they'll just compile the stdlib files without openmp enabled.
However, some others think that stdlib should not use openmp, and rather users should parallelize themselves:
https://github.com/fortran-lang/stdlib/pull/189/files#r426173077
Although it's not clear to me how to do it in the context of the csr_matvec
routine, as if it is to run in parallel, it needs to be parallelized from inside. Perhaps we could provide two versions of the subroutine, one serial, one parallel.
@zerothi, do you want to discuss it more here?
If OpenMP is used in stdlib
, I think it should be with orphaned procedures. The user can then control the parallelization (e.g. to call a stdlib
procedure inside or outside a parallel region). Using parallel regions inside stdlib
procedures will limit their utility.
Can you give an example? What is an orphaned procedure?
On Sat, Jun 20, 2020, at 2:25 PM, Jeremie Vandenplas wrote:
If OpenMP is used in
stdlib
, I think it should be with orphaned procedures. The user can then control the parallelization (e.g. to call astdlib
procedure inside or outside a parallel region). Using parallel regions insidestdlib
procedures will limit their utility.— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/213#issuecomment-647041869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAFAWG6DMCQRHLS6B6Y6CDRXULKTANCNFSM4ODQTKIA.
Here is an example with an orphaned subroutine (the example is stupid but it is to illustrate an OpenMP orphaned procedure):
program test
!$ use omp_lib
implicit none
integer :: i
real :: a(5)
a = [(i, i=1, 5)]
print*,' Outside a parallel region'
call printa(a)
print*,' Inside a parallel region'
!$omp parallel
call printa(a)
!$omp end parallel
contains
subroutine printa(a)
real, intent(in) :: a(:)
integer :: i, rang
rang = -1
!$omp do
do i=1,size(a)
!$ rang = omp_get_thread_num()
print*,'value: ',a(i),' at thread ', rang
end do
!$omp end do
end subroutine
end program
Compiled without OpenMP, the output is:
Outside a parallel region
value: 1.00000000 at thread -1
value: 2.00000000 at thread -1
value: 3.00000000 at thread -1
value: 4.00000000 at thread -1
value: 5.00000000 at thread -1
Inside a parallel region
value: 1.00000000 at thread -1
value: 2.00000000 at thread -1
value: 3.00000000 at thread -1
value: 4.00000000 at thread -1
value: 5.00000000 at thread -1
Compiled with OpenMP (and run with 3 threads):
Outside a parallel region
value: 1.00000000 at thread 0
value: 2.00000000 at thread 0
value: 3.00000000 at thread 0
value: 4.00000000 at thread 0
value: 5.00000000 at thread 0
Inside a parallel region
value: 3.00000000 at thread 1
value: 4.00000000 at thread 1
value: 1.00000000 at thread 0
value: 2.00000000 at thread 0
value: 5.00000000 at thread 2
Ah I see, that's what @zerothi meant in the comment at https://github.com/fortran-lang/stdlib/pull/189/files#r426173077. The idea is to use openmp, but never to use the omp parallel
pragma inside stdlib and always compile with openmp. That way, if users just call stdlib
, it will run in serial. But if they introduce a parallel region themselves, then stdlib
will run parallel out of the box. I like this approach a lot.
Here is an example with an orphaned subroutine (the example is stupid but it is to illustrate an OpenMP orphaned procedure):
program test !$ use omp_lib implicit none integer :: i real :: a(5) a = [(i, i=1, 5)] print*,' Outside a parallel region' call printa(a) print*,' Inside a parallel region' !$omp parallel call printa(a) !$omp end parallel contains subroutine printa(a) real, intent(in) :: a(:) integer :: i, rang rang = -1 !$omp do do i=1,size(a) !$ rang = omp_get_thread_num() print*,'value: ',a(i),' at thread ', rang end do !$omp end do end subroutine end program
Compiled without OpenMP, the output is:
Outside a parallel region value: 1.00000000 at thread -1 value: 2.00000000 at thread -1 value: 3.00000000 at thread -1 value: 4.00000000 at thread -1 value: 5.00000000 at thread -1 Inside a parallel region value: 1.00000000 at thread -1 value: 2.00000000 at thread -1 value: 3.00000000 at thread -1 value: 4.00000000 at thread -1 value: 5.00000000 at thread -1
Compiled with OpenMP (and run with 3 threads):
Outside a parallel region value: 1.00000000 at thread 0 value: 2.00000000 at thread 0 value: 3.00000000 at thread 0 value: 4.00000000 at thread 0 value: 5.00000000 at thread 0 Inside a parallel region value: 3.00000000 at thread 1 value: 4.00000000 at thread 1 value: 1.00000000 at thread 0 value: 2.00000000 at thread 0 value: 5.00000000 at thread 2
That's exactly what i meant so that our stdlib subroutines has parallel features but without using parallel region in order to control parallelism by user.
Agreed, this was my idea, let the user decide how parallelism is enabled :)
What happens when a user calls a function (which they want to parallelize and control the parallelization of) which then calls a stdlib function?
program test
!$ use omp_lib
implicit none
integer :: i
real :: a(5)
call expensive_sub()
contains
subroutine expensive_sub()
integer :: i
!$omp parallel
do i=1,100
! Other expensive calculations
call printa(a)
end do
!$omp end do
end subroutine expensive_sub
subroutine printa(a)
real, intent(in) :: a(:)
integer :: i, rang
rang = -1
!$omp do
do i=1,size(a)
!$ rang = omp_get_thread_num()
print*,'value: ',a(i),' at thread ', rang
end do
!$omp end do
end subroutine
end program
Do you not end up with nested parallelization? which i doubt is what people expect.
What happens when a user calls a function (which they want to parallelize and control the parallelization of) which then calls a stdlib function? ...
Do you not end up with nested parallelization? which i doubt is what people expect.
Your example works exactly as intended (only 1 level of parallelism is used):
EDIT: oh sorry, there was an error (you didn't have parallel do
, only parallel
so didn't see it.
However, if you do:
program test
!$ use omp_lib
implicit none
integer :: i
real :: a(5)
!$omp parallel
call expensive_sub()
!$omp end parallel
contains
subroutine expensive_sub()
integer :: i
!$omp parallel do
do i=1,100
! Other expensive calculations
call printa(a)
end do
!$omp end do
end subroutine expensive_sub
subroutine printa(a)
real, intent(in) :: a(:)
integer :: i, rang
rang = -1
!$omp do
do i=1,size(a)
!$ rang = omp_get_thread_num()
print*,'value: ',a(i),' at thread ', rang
end do
!$omp end do
end subroutine
end program
you'll get nesting. Either we should implement omp
on a orphaning way (as proposed), or supply threaded variants of the methods with some common suffix.
EDIT: you can control inside omp do if(...)
if we want to disallow too many nested levels, but that may also come as a surprise.
EDIT: oh sorry, there was an error (you didn't have parallel do, only parallel so didn't see it.
Sorry that was my mistake i meant parrallel do.
Ok, so this can still easily be mitigated.
program test
!$ use omp_lib
implicit none
logical :: nested
integer :: imax_nested, it, il
!$omp parallel default(shared), private(it,il,nested,imax_nested)
nested = omp_get_nested()
imax_nested = omp_get_max_active_levels()
it = omp_get_thread_num()
il = omp_get_active_level()
!$omp master
print *, "Allow nested: ",nested, imax_nested
!$omp end master
!$omp barrier
call sub_a(it,il)
!$omp barrier
!$omp single
print *,''
flush(6)
!$omp end single
call sub_a_limit(it,il)
!$omp end parallel
contains
subroutine sub_a(ot, ol)
integer, intent(in) :: ot, ol
integer :: it, il
!$omp parallel default(shared), private(it,il)
it = omp_get_thread_num()
il = omp_get_active_level()
call sub_b(ot, ol, it, il)
!$omp end parallel
end subroutine sub_a
subroutine sub_a_limit(ot, ol)
integer, intent(in) :: ot, ol
integer :: it, il
!$omp parallel default(shared), private(it,il)
it = omp_get_thread_num()
il = omp_get_active_level()
call omp_set_max_active_levels(il)
call sub_b(ot, ol, it, il)
!$omp end parallel
end subroutine sub_a_limit
subroutine sub_b(ot, ol, it, il)
integer, intent(in) :: ot, ol, it, il
integer :: ct, cl
!$omp parallel default(shared), private(ct,cl)
ct = omp_get_thread_num()
cl = omp_get_active_level()
print '(a,tr2,5(tr2,i0,"/",i0))','sub_b: ', ot, ol, it, il, ct, cl
!$omp end parallel
end subroutine sub_b
end program test
I.e. users can use call omp_set_max_active_levels(...)
to control number of nested levels.
Perhaps I should clarify runned parameters.
In pre OpenMP 5, one should do:
OMP_NUM_THREADS=2 OMP_MAX_ACTIVE_LEVELS=3 OMP_NESTED=true ./a.out
where OMP_MAX_ACTIVE_LEVELS
controls the overall number of levels.
In OpenMP 5 OMP_NESTED
is deprecated and only OMP_MAX_ACTIVE_LEVELS
are needed.
Hi there I started learning OpenMP library couple weeks ago and would like to parallelize and speed up fortran standard library codebase.