TangLaoya / parafem

Automatically exported from code.google.com/p/parafem
0 stars 0 forks source link

Seg fault with OpemMPI 1.4.1 and gcc 4.4.0 #8

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

To reproduce the problem:
1. Compile parafem problem p121 (svn revision 424) with openmpi 1.4.1 and gcc 
4.4.0 gfortran
2. Run with ed4-small-c3d2 data (input generated by running sg12mg 
ed4-small-c3d20.mg)

Parafem starts to execute then seg faults:

Output file from SGE
--------------------

PE no:        1 nels_pp:     2000
PE no:        3 nels_pp:     2000PE no:        2 nels_pp:     2000

PE no:        4 nels_pp:     2000
 HEADER =*ELEMENTS                                         
PE no:        1 neq_pp:    24590
PE no:        3 neq_pp:    24590
PE no:        2 neq_pp:    24590
PE no:        4 neq_pp:    24590
No. of elements of PE    2 required by PE    1:     2110
No. of elements of PE    3 required by PE    4:     2110
No. of elements of PE    1 required by PE    2:     1530
No. of elements of PE    2 required by PE    3:     1820
No. of elements of PE    3 required by PE    2:     1820
No. of elements of PE    1 required by PE    1:    24590
 Total number of unique elements required
 i.e. length of pl_pp required:        26700No. of elements of PE    4 required 
by PE    3:     1530

PE:    1 Number of remote PEs required:        1
PE:    1 Accesses - remote, local, remote/local:   2110 24590    0.09 From      
1 PEs
No. of elements of PE    4 required by PE    4:    24590
 Total number of unique elements required
 i.e. length of pl_pp required:        26700
PE:    4 Number of remote PEs required:        1
PE:    4 Accesses - remote, local, remote/local:   2110 24590    0.09 From      
1 PEs
No. of elements of PE    2 required by PE    2:    24590
 Total number of unique elements required
 i.e. length of pl_pp required:        27940
PE:    2 Number of remote PEs required:        2
No. of elements of PE    3 required by PE    3:    24590
 Total number of unique elements required
 i.e. length of pl_pp required:        27940
PE:    3 Number of remote PEs required:        2
PE:    2 Accesses - remote, local, remote/local:   3350 24590    0.14 From      
2 PEs
PE:    3 Accesses - remote, local, remote/local:   3350 24590    0.14 From      
2 PEs
Average accesses ratio - remote/local:     0.11
Total remote accesses                :    10920
Average remote accesses per PE       :  1820.00
PE:    1 request sent to PE:     2
PE:    2 request sent to PE:     1
PE:    2 request sent to PE:     3
PE:    4 request sent to PE:     3PE:    3 request sent to PE:     2

PE:    3 request sent to PE:     4
Number of elements of PE    4 required by PE    3:     1530Number of elements of
 PE    1 required by PE    2:     1530Number of elements of PE    2 required by 
PE    1:     2110
Number of elements of PE    2 required by PE    3:     1820
PE:    4 Number of PEs to send data to:    1

PE:    2 Number of PEs to send data to:    2

PE:    1 Number of PEs to send data to:    1
Number of elements of PE    3 required by PE    2:     1820
Number of elements of PE    3 required by PE    4:     2110
PE:    3 Number of PEs to send data to:    2

Error:

[cat4:29356] *** Process received signal ***
[cat4:29356] Signal: Segmentation fault (11)
[cat4:29356] Signal code: Address not mapped (1)
[cat4:29356] Failing at address: (nil)
[cat4:29356] [ 0] /lib64/libpthread.so.0 [0x3c3620eb10]
[cat4:29356] [ 1] ./p121(MAIN__+0x321a) [0x4063ea]
[cat4:29356] [ 2] ./p121(main+0x2a) [0x428aaa]
[cat4:29356] [ 3] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3c35a1d994]
[cat4:29356] [ 4] ./p121 [0x403109]
[cat4:29356] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 29356 on node cat4 exited on signal 
11 (Segmentation fault).
--------------------------------------------------------------------------

Output from uname -a
--------------------

[mcbicjb2@redqueen parafem_tests]$ uname -a
Linux redqueen.rcs.manchester.ac.uk 2.6.18-194.11.4.el5 #1 SMP Tue Sep 21 
06:46:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

Original issue reported on code.google.com by Jonathan3145 on 18 Mar 2011 at 1:03

GoogleCodeExporter commented 9 years ago
Thankyou for posting this problem. I've had various issues with the gfortran 
compiler in the past and abandoned using it. gfortran appeared to be very 
buggy. I'll have another look. 

Original comment by drmarge...@gmail.com on 18 Mar 2011 at 1:09

GoogleCodeExporter commented 9 years ago
Hi, this might be related to the p121 call of find_g3() which caused seg faults 
on HECToR too. This has since been fixed so please try again with the latest 
revision.

Regards,
Louise

Original comment by louise.m.lever@gmail.com on 13 Apr 2011 at 4:06

GoogleCodeExporter commented 9 years ago
This might be related to issue 12 found by Seid at NCSA. Worth trying the 
suggested temporary fix and reporting back.

Original comment by drmarge...@gmail.com on 25 May 2011 at 12:04

GoogleCodeExporter commented 9 years ago
Thanks, and spot on. Adding 

IF(fixed_freedoms == 0) THEN
   fixed_freedoms_pp = 0
END IF

as suggested solved the problem.

Original comment by Jonathan3145 on 25 May 2011 at 3:33

GoogleCodeExporter commented 9 years ago
Marvelous! When the change is incorporated into the version on SVN, I'll close 
these issues.

Original comment by drmarge...@gmail.com on 26 May 2011 at 6:29