fortran-lang / stdlib

Fortran Standard Library
https://stdlib.fortran-lang.org
MIT License
1.09k stars 168 forks source link

ENH: Using OpenMP #213

Open miladsade96 opened 4 years ago

miladsade96 commented 4 years ago

Hi there I started learning OpenMP library couple weeks ago and would like to parallelize and speed up fortran standard library codebase.

certik commented 4 years ago

OpenMP is specified using pragmas which are comments, so if the file is compiled without openmp enabled, it will just be ignored. So I think it's fine if we allow openmp in our code base, because if users do not want stdlib to be parallelized using openmp, they'll just compile the stdlib files without openmp enabled.

However, some others think that stdlib should not use openmp, and rather users should parallelize themselves:

https://github.com/fortran-lang/stdlib/pull/189/files#r426173077

Although it's not clear to me how to do it in the context of the csr_matvec routine, as if it is to run in parallel, it needs to be parallelized from inside. Perhaps we could provide two versions of the subroutine, one serial, one parallel.

@zerothi, do you want to discuss it more here?

jvdp1 commented 4 years ago

If OpenMP is used in stdlib, I think it should be with orphaned procedures. The user can then control the parallelization (e.g. to call a stdlib procedure inside or outside a parallel region). Using parallel regions inside stdlib procedures will limit their utility.

certik commented 4 years ago

Can you give an example? What is an orphaned procedure?

On Sat, Jun 20, 2020, at 2:25 PM, Jeremie Vandenplas wrote:

If OpenMP is used in stdlib, I think it should be with orphaned procedures. The user can then control the parallelization (e.g. to call a stdlib procedure inside or outside a parallel region). Using parallel regions inside stdlib procedures will limit their utility.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/213#issuecomment-647041869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAFAWG6DMCQRHLS6B6Y6CDRXULKTANCNFSM4ODQTKIA.

jvdp1 commented 4 years ago

Here is an example with an orphaned subroutine (the example is stupid but it is to illustrate an OpenMP orphaned procedure):

program test
  !$ use omp_lib
  implicit none
  integer :: i
  real :: a(5)

  a = [(i, i=1, 5)]

  print*,' Outside a parallel region'
  call printa(a)

  print*,' Inside a parallel region'
  !$omp parallel
  call printa(a)
  !$omp end parallel

contains

subroutine printa(a)
  real, intent(in) :: a(:) 

  integer :: i, rang

  rang = -1

  !$omp do
  do i=1,size(a)
    !$ rang =  omp_get_thread_num()
    print*,'value: ',a(i),' at thread ', rang
  end do
  !$omp end do

end subroutine

end program

Compiled without OpenMP, the output is:

 Outside a parallel region
 value:    1.00000000      at thread           -1
 value:    2.00000000      at thread           -1
 value:    3.00000000      at thread           -1
 value:    4.00000000      at thread           -1
 value:    5.00000000      at thread           -1
  Inside a parallel region
 value:    1.00000000      at thread           -1
 value:    2.00000000      at thread           -1
 value:    3.00000000      at thread           -1
 value:    4.00000000      at thread           -1
 value:    5.00000000      at thread           -1

Compiled with OpenMP (and run with 3 threads):

 Outside a parallel region
 value:    1.00000000      at thread            0
 value:    2.00000000      at thread            0
 value:    3.00000000      at thread            0
 value:    4.00000000      at thread            0
 value:    5.00000000      at thread            0
  Inside a parallel region
 value:    3.00000000      at thread            1
 value:    4.00000000      at thread            1
 value:    1.00000000      at thread            0
 value:    2.00000000      at thread            0
 value:    5.00000000      at thread            2
certik commented 4 years ago

Ah I see, that's what @zerothi meant in the comment at https://github.com/fortran-lang/stdlib/pull/189/files#r426173077. The idea is to use openmp, but never to use the omp parallel pragma inside stdlib and always compile with openmp. That way, if users just call stdlib, it will run in serial. But if they introduce a parallel region themselves, then stdlib will run parallel out of the box. I like this approach a lot.

miladsade96 commented 4 years ago

Here is an example with an orphaned subroutine (the example is stupid but it is to illustrate an OpenMP orphaned procedure):

program test
  !$ use omp_lib
  implicit none
  integer :: i
  real :: a(5)

  a = [(i, i=1, 5)]

  print*,' Outside a parallel region'
  call printa(a)

  print*,' Inside a parallel region'
  !$omp parallel
  call printa(a)
  !$omp end parallel

contains

subroutine printa(a)
  real, intent(in) :: a(:) 

  integer :: i, rang

  rang = -1

  !$omp do
  do i=1,size(a)
    !$ rang =  omp_get_thread_num()
    print*,'value: ',a(i),' at thread ', rang
  end do
  !$omp end do

end subroutine

end program

Compiled without OpenMP, the output is:

 Outside a parallel region
 value:    1.00000000      at thread           -1
 value:    2.00000000      at thread           -1
 value:    3.00000000      at thread           -1
 value:    4.00000000      at thread           -1
 value:    5.00000000      at thread           -1
  Inside a parallel region
 value:    1.00000000      at thread           -1
 value:    2.00000000      at thread           -1
 value:    3.00000000      at thread           -1
 value:    4.00000000      at thread           -1
 value:    5.00000000      at thread           -1

Compiled with OpenMP (and run with 3 threads):

 Outside a parallel region
 value:    1.00000000      at thread            0
 value:    2.00000000      at thread            0
 value:    3.00000000      at thread            0
 value:    4.00000000      at thread            0
 value:    5.00000000      at thread            0
  Inside a parallel region
 value:    3.00000000      at thread            1
 value:    4.00000000      at thread            1
 value:    1.00000000      at thread            0
 value:    2.00000000      at thread            0
 value:    5.00000000      at thread            2

That's exactly what i meant so that our stdlib subroutines has parallel features but without using parallel region in order to control parallelism by user.

zerothi commented 4 years ago

Agreed, this was my idea, let the user decide how parallelism is enabled :)

rjfarmer commented 4 years ago

What happens when a user calls a function (which they want to parallelize and control the parallelization of) which then calls a stdlib function?

program test
  !$ use omp_lib
  implicit none
  integer :: i
  real :: a(5)

  call expensive_sub()

contains

subroutine expensive_sub()
  integer :: i

  !$omp parallel
  do i=1,100
  ! Other expensive calculations
  call printa(a)
  end do
  !$omp end do

end subroutine expensive_sub

subroutine printa(a)
  real, intent(in) :: a(:) 

  integer :: i, rang

  rang = -1

  !$omp do
  do i=1,size(a)
    !$ rang =  omp_get_thread_num()
    print*,'value: ',a(i),' at thread ', rang
  end do
  !$omp end do

end subroutine

end program

Do you not end up with nested parallelization? which i doubt is what people expect.

zerothi commented 4 years ago

What happens when a user calls a function (which they want to parallelize and control the parallelization of) which then calls a stdlib function? ...

Do you not end up with nested parallelization? which i doubt is what people expect.

Your example works exactly as intended (only 1 level of parallelism is used): EDIT: oh sorry, there was an error (you didn't have parallel do, only parallel so didn't see it.

However, if you do:

program test
  !$ use omp_lib
  implicit none
  integer :: i
  real :: a(5)
  !$omp parallel

  call expensive_sub()
 !$omp end parallel
contains

subroutine expensive_sub()
  integer :: i

  !$omp parallel do
  do i=1,100
  ! Other expensive calculations
  call printa(a)
  end do
  !$omp end do

end subroutine expensive_sub

subroutine printa(a)
  real, intent(in) :: a(:) 

  integer :: i, rang

  rang = -1

  !$omp do
  do i=1,size(a)
    !$ rang =  omp_get_thread_num()
    print*,'value: ',a(i),' at thread ', rang
  end do
  !$omp end do

end subroutine

end program

you'll get nesting. Either we should implement omp on a orphaning way (as proposed), or supply threaded variants of the methods with some common suffix.

EDIT: you can control inside omp do if(...) if we want to disallow too many nested levels, but that may also come as a surprise.

rjfarmer commented 4 years ago

EDIT: oh sorry, there was an error (you didn't have parallel do, only parallel so didn't see it.

Sorry that was my mistake i meant parrallel do.

zerothi commented 4 years ago

Ok, so this can still easily be mitigated.

program test
!$ use omp_lib
  implicit none

  logical :: nested
  integer :: imax_nested, it, il

!$omp parallel default(shared), private(it,il,nested,imax_nested)

  nested = omp_get_nested()
  imax_nested = omp_get_max_active_levels()

  it = omp_get_thread_num()
  il = omp_get_active_level()

!$omp master
  print *, "Allow nested: ",nested, imax_nested
!$omp end master

!$omp barrier

  call sub_a(it,il)

!$omp barrier
!$omp single
  print *,''
  flush(6)
!$omp end single

  call sub_a_limit(it,il)

!$omp end parallel

contains

  subroutine sub_a(ot, ol)
    integer, intent(in) :: ot, ol
    integer :: it, il

!$omp parallel default(shared), private(it,il)
    it = omp_get_thread_num()
    il = omp_get_active_level()
    call sub_b(ot, ol, it, il)
!$omp end parallel

  end subroutine sub_a

  subroutine sub_a_limit(ot, ol)
    integer, intent(in) :: ot, ol
    integer :: it, il

!$omp parallel default(shared), private(it,il)
    it = omp_get_thread_num()
    il = omp_get_active_level()
    call omp_set_max_active_levels(il)
    call sub_b(ot, ol, it, il)
!$omp end parallel

  end subroutine sub_a_limit

  subroutine sub_b(ot, ol, it, il)
    integer, intent(in) :: ot, ol, it, il
    integer :: ct, cl

!$omp parallel default(shared), private(ct,cl)
    ct = omp_get_thread_num()
    cl = omp_get_active_level()
    print '(a,tr2,5(tr2,i0,"/",i0))','sub_b: ', ot, ol, it, il, ct, cl
!$omp end parallel

  end subroutine sub_b

end program test

I.e. users can use call omp_set_max_active_levels(...) to control number of nested levels.

Perhaps I should clarify runned parameters.

In pre OpenMP 5, one should do:

OMP_NUM_THREADS=2 OMP_MAX_ACTIVE_LEVELS=3 OMP_NESTED=true ./a.out

where OMP_MAX_ACTIVE_LEVELS controls the overall number of levels. In OpenMP 5 OMP_NESTED is deprecated and only OMP_MAX_ACTIVE_LEVELS are needed.