IntelLabs / HPAT.jl

High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters.
BSD 2-Clause "Simplified" License
120 stars 16 forks source link

Errors in Logistic Regression example #21

Open samuel100 opened 7 years ago

samuel100 commented 7 years ago

When running the logistic regression example, the model returns but there are also some errors:

--------------------------------------------------------------------------
[[1031,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: juliabox

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
Distributed-memory MPI mode.
OpenMP is not used.
/home/samkemp/.julia/v0.5/ParallelAccelerator/src/../deps/generated/cgen_output0.cpp:3:18: fatal error: hdf5.h: No such file or directory
compilation terminated.
OptFramework failed to optimize function ##logistic_regression#271 in optimization pass ParallelAccelerator.Driver.toCGen with error ErrorException("failed process: Process(`mpic++ -O3 -std=c++11 -g -fpic -c -o /home/samkemp/.julia/v0.5/ParallelAccelerator/src/../deps/generated/cgen_output0.o /home/samkemp/.julia/v0.5/ParallelAccelerator/src/../deps/generated/cgen_output0.cpp`, ProcessExited(1)) [1]")
result = Float32[141.851 138.706 140.859 139.793 137.871 133.967 138.595 132.593 143.106 139.658]

It looks like hdf5.h cannot be found (I have added it to my PATH, but it still cannot find it). Also there appears to be something going awry in ParallelAccelerator.

When running on 2 processes, there is a segmentation fault (there should not be a memory issue here since I was looking at top when the process was running and I had plenty of spare memory. Error below

[[1235,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: juliabox

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
[juliabox:12714] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[juliabox:12714] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Distributed-memory MPI mode.
OpenMP is not used.
/home/samkemp/.julia/v0.5/ParallelAccelerator/src/../deps/generated/cgen_output0.cpp:3:18: fatal error: hdf5.h: No such file or directory
compilation terminated.
OptFramework failed to optimize function ##logistic_regression#271 in optimization pass ParallelAccelerator.Driver.toCGen with error ErrorException("failed process: Process(`mpic++ -O3 -std=c++11 -g -fpic -c -o /home/samkemp/.julia/v0.5/ParallelAccelerator/src/../deps/generated/cgen_output0.o /home/samkemp/.julia/v0.5/ParallelAccelerator/src/../deps/generated/cgen_output0.cpp`, ProcessExited(1)) [1]")
result = Float32[96.872 93.1305 95.2714 95.0097 98.0419 99.3909 94.9121 95.041 96.7065 91.6394]
j2c_array_new called with invalid key 1julia: /home/samkemp/.julia/v0.5/ParallelAccelerator/src/../deps/generated/cgen_output0.cpp:177: void* j2c_array_new(int, void*, unsigned int, int64_t*): Assertion `false' failed.

signal (6): Aborted
while loading /home/samkemp/.julia/v0.5/HPAT/examples/logistic_regression.jl, in expression starting on line 78
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7fb08f082bd6)
__assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
j2c_array_new at /home/samkemp/.julia/v0.5/ParallelAccelerator/src/../deps/generated/cgen_output0.cpp:177
#9 at /home/samkemp/.julia/v0.5/ParallelAccelerator/src/driver.jl:309
to_j2c_array at /home/samkemp/.julia/v0.5/ParallelAccelerator/src/j2c-array.jl:146
to_j2c_array at /home/samkemp/.julia/v0.5/ParallelAccelerator/src/j2c-array.jl:121 [inlined]
##_pplogistic_regressionp271_j2c_proxy#293 at /home/samkemp/.julia/v0.5/ParallelAccelerator/src/driver.jl:374
unknown function (ip: 0x7fae6888a649)
jl_call_method_internal at /build/julia-Fy046j/julia-0.5.0/src/julia_internal.h:189 [inlined]
jl_apply_generic at /build/julia-Fy046j/julia-0.5.0/src/gf.c:1942
logistic_regression at /home/samkemp/.julia/v0.5/CompilerTools/src/OptFramework.jl:598
unknown function (ip: 0x7fae7ff5ffd9)
jl_call_method_internal at /build/julia-Fy046j/julia-0.5.0/src/julia_internal.h:189 [inlined]
jl_apply_generic at /build/julia-Fy046j/julia-0.5.0/src/gf.c:1942
main at /home/samkemp/.julia/v0.5/HPAT/examples/logistic_regression.jl:73
unknown function (ip: 0x7fae7ff383cf)
jl_call_method_internal at /build/julia-Fy046j/julia-0.5.0/src/julia_internal.h:189 [inlined]
jl_apply_generic at /build/julia-Fy046j/julia-0.5.0/src/gf.c:1942
do_call at /build/julia-Fy046j/julia-0.5.0/src/interpreter.c:66
eval at /build/julia-Fy046j/julia-0.5.0/src/interpreter.c:190
jl_toplevel_eval_flex at /build/julia-Fy046j/julia-0.5.0/src/toplevel.c:558
jl_parse_eval_all at /build/julia-Fy046j/julia-0.5.0/src/ast.c:717
jl_load at /build/julia-Fy046j/julia-0.5.0/src/toplevel.c:596
jl_load_ at /build/julia-Fy046j/julia-0.5.0/src/toplevel.c:605
include_from_node1 at ./loading.jl:488
unknown function (ip: 0x7fb08a7bad5b)
jl_call_method_internal at /build/julia-Fy046j/julia-0.5.0/src/julia_internal.h:189 [inlined]
jl_apply_generic at /build/julia-Fy046j/julia-0.5.0/src/gf.c:1942
process_options at ./client.jl:262
_start at ./client.jl:318
unknown function (ip: 0x7fb08a7e0378)
jl_call_method_internal at /build/julia-Fy046j/julia-0.5.0/src/julia_internal.h:189 [inlined]
jl_apply_generic at /build/julia-Fy046j/julia-0.5.0/src/gf.c:1942
unknown function (ip: 0x40185c)
unknown function (ip: 0x4012f6)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401348)
Allocations: 52424920 (Pool: 52419236; Big: 5684); GC: 96

signal (15): Terminated
while loading no file, in expression starting on line 0
nanosleep at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
usleep at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
ompi_mpi_finalize at /usr/lib/libmpi.so.12 (unknown line)
pmpi_finalize__ at /usr/lib/libmpi_mpifh.so.12 (unknown line)
HPAT_finalize at /home/samkemp/.julia/v0.5/HPAT/src/HPAT.jl:305
unknown function (ip: 0x7f971cca837f)
jl_call_method_internal at /build/julia-Fy046j/julia-0.5.0/src/julia_internal.h:189 [inlined]
jl_apply_generic at /build/julia-Fy046j/julia-0.5.0/src/gf.c:1942
_atexit at ./initdefs.jl:114
unknown function (ip: 0x7f993ecc4c78)
jl_call_method_internal at /build/julia-Fy046j/julia-0.5.0/src/julia_internal.h:189 [inlined]
jl_apply_generic at /build/julia-Fy046j/julia-0.5.0/src/gf.c:1942
jl_apply at /build/julia-Fy046j/julia-0.5.0/src/julia.h:1392 [inlined]
jl_atexit_hook at /build/julia-Fy046j/julia-0.5.0/src/init.c:244
unknown function (ip: 0x4012ff)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401348)
unknown function (ip: 0xffffffffffffffff)
Allocations: 52777806 (Pool: 52771943; Big: 5863); GC: 97

signal (11): Segmentation fault
while loading no file, in expression starting on line 0
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 12717 on node juliabox exited on signal 6 (Aborted).
samuel100 commented 7 years ago

To solve both errors outlined above I took the following steps....

  1. Copy libhdf5.so file into /usr/lib i.e.

sudo cp /usr/lib/x86_64-linux-gnu/hdf5/openmpi/libhdf5.so /usr/lib/

  1. Copy all the header files in openmpi into ParallelAccelerator src../deps/generated folder, i.e.

cp /usr/include/hdf5/openmpi/*.h /home/samkemp/.julia/v0.5/ParallelAccelerator/src/../deps/generated/

ehsantn commented 7 years ago

Thanks for posting your solutions @samuel100 . Installing MPI and HDF5 and making sure HPAT picks them up properly seems challenging in a portable way. Maybe we need CMake.