GeoscienceAustralia / anuga_core

AnuGA for the simulation of the shallow water equation
https://anuga.anu.edu.au
Other
191 stars 94 forks source link

Segmentation fault on tests #132

Closed JamesRamm closed 7 years ago

JamesRamm commented 7 years ago

Hi I'm getting a segmentation fault when running the tests. I am running on ubuntu, using a conda environment. My exact installation process was as follows:

conda create -n anuga python=2
source activate anuga
conda install nose numpy scipy matplotlib netcdf4
conda install -c pingucarsti gdal
git clone https://github.com/GeoscienceAustralia/anuga_core.git
cd anuga_core/
python setup.py build
python setup.py install

Then running python runtests.py gives the following output:

$ python runtests.py 
Building, see build.log...
Build OK
Running unit tests for anuga
NumPy version 1.13.1
NumPy relaxed strides checking option: True
NumPy is installed in /home/james/miniconda3/envs/anuga/lib/python2.7/site-packages/numpy
Python version 2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:09:15) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
nose version 1.3.7
...............................................................................................Segmentation fault (core dumped)
stoiver commented 7 years ago

@JamesRamm, could you rerun the tests with the -v flag, ie python runtests.py -v

That should at least give us an idea of where the error is.

Which version of Ubuntu are you using?

Then I will try to replicate the error.

JamesRamm commented 7 years ago

Hi Ubuntu version 16.10 (yakkety) Linux version 4.8.0-52-generic

verbose output:

Building, see build.log...                                                                                         
Build OK                                                                                                           
Running unit tests for anuga
NumPy version 1.13.1
NumPy relaxed strides checking option: True
NumPy is installed in /home/james/miniconda3/envs/anuga/lib/python2.7/site-packages/numpy
Python version 2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:09:15) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
nose version 1.3.7
test_basic_single_line_grid (anuga.abstract_2d_finite_volumes.tests.test_ermapper.Test_ERMapper) ... ok
test_basic_single_line_grid_default_format (anuga.abstract_2d_finite_volumes.tests.test_ermapper.Test_ERMapper) ... ok
test_header_creation (anuga.abstract_2d_finite_volumes.tests.test_ermapper.Test_ERMapper) ... ok
test_write_default_header (anuga.abstract_2d_finite_volumes.tests.test_ermapper.Test_ERMapper) ... ok
test_write_grid (anuga.abstract_2d_finite_volumes.tests.test_ermapper.Test_ERMapper) ... ok
test_write_non_default_header (anuga.abstract_2d_finite_volumes.tests.test_ermapper.Test_ERMapper) ... ok
Most of this test was copied from test_interpolate ... ok
Check sww2csv timeseries at centroid. ... ok
test_sww2csv_gauge_point_off_mesh (anuga.abstract_2d_finite_volumes.tests.test_gauge.Test_Gauge) ... ok
test_sww2csv_gauges1 (anuga.abstract_2d_finite_volumes.tests.test_gauge.Test_Gauge) ... ok
Most of this test was copied from test_interpolate ... ok
This is testing the sww2csv_gauges function, by creating multiple ... ok
Check sww2csv timeseries at centroid, then output the centroid coordinates. ... ok
test_areas (anuga.abstract_2d_finite_volumes.tests.test_general_mesh.Test_General_Mesh) ... ok
test_assert_index_in_nodes - ... ok
test_get_edge_midpoint_coordinates (anuga.abstract_2d_finite_volumes.tests.test_general_mesh.Test_General_Mesh) ... ok
test_get_vertex_coordinates_triangle_id ... ok
test_get_edge_midpoint_coordinates_with_geo_ref (anuga.abstract_2d_finite_volumes.tests.test_general_mesh.Test_General_Mesh) ... ok
test_get_triangles_and_vertices_per_node - ... ok
test_get_triangles_and_vertices_per_node - ... ok
get unique_vertex based on triangle lists. ... ok
test_get_vertex_coordinates (anuga.abstract_2d_finite_volumes.tests.test_general_mesh.Test_General_Mesh) ... ok
test_get_vertex_coordinates_triangle_id ... ok
test_get_vertex_coordinates_with_geo_ref (anuga.abstract_2d_finite_volumes.tests.test_general_mesh.Test_General_Mesh) ... ok
Get connectivity based on triangle lists. ... ok
test_one_degenerate_triangles (anuga.abstract_2d_finite_volumes.tests.test_general_mesh.Test_General_Mesh) ... ok
test_two_degenerate_triangles (anuga.abstract_2d_finite_volumes.tests.test_general_mesh.Test_General_Mesh) ... ok
Check that structures are correct. ... ok
test_dirichlet (anuga.abstract_2d_finite_volumes.tests.test_generic_boundary_conditions.Test_Generic_Boundary_Conditions) ... ok
test_dirichlet_empty (anuga.abstract_2d_finite_volumes.tests.test_generic_boundary_conditions.Test_Generic_Boundary_Conditions) ... ok
Test that boundary object complains if number of ... ok
test_generic (anuga.abstract_2d_finite_volumes.tests.test_generic_boundary_conditions.Test_Generic_Boundary_Conditions) ... ok
test_time (anuga.abstract_2d_finite_volumes.tests.test_generic_boundary_conditions.Test_Generic_Boundary_Conditions) ... ok
test_time_space_boundary (anuga.abstract_2d_finite_volumes.tests.test_generic_boundary_conditions.Test_Generic_Boundary_Conditions) ... ok
test_transmissive (anuga.abstract_2d_finite_volumes.tests.test_generic_boundary_conditions.Test_Generic_Boundary_Conditions) ... ok
test_CFL (anuga.abstract_2d_finite_volumes.tests.test_generic_domain.Test_Domain) ... ok
Test that quantities already set can be added to using ... ok
test_boundary_conditions (anuga.abstract_2d_finite_volumes.tests.test_generic_domain.Test_Domain) ... ok
test_boundary_indices (anuga.abstract_2d_finite_volumes.tests.test_generic_domain.Test_Domain) ... ok
test_conserved_evolved_boundary_conditions (anuga.abstract_2d_finite_volumes.tests.test_generic_domain.Test_Domain) ... ok
test_conserved_quantities (anuga.abstract_2d_finite_volumes.tests.test_generic_domain.Test_Domain) ... ok
Quantity created from other quantities using arbitrary expression ... ok
Domain implements a default first order gradient limiter ... ok
test_rectangular_periodic_and_ghosts (anuga.abstract_2d_finite_volumes.tests.test_generic_domain.Test_Domain) ... ok
test_set_quanitities_to_be_monitored ... ok
Quantity set using arbitrary expression ... ok
Set quantities for sub region ... ok
test_setting_timestepping_method ... ok
test_simple (anuga.abstract_2d_finite_volumes.tests.test_generic_domain.Test_Domain) ... ok
test_update_conserved_quantities (anuga.abstract_2d_finite_volumes.tests.test_generic_domain.Test_Domain) ... ok
test_simple (anuga.abstract_2d_finite_volumes.tests.test_ghost.Test_Domain) ... ok
test_basic_triangle (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_boundary_inputs (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_boundary_inputs_using_all_defaults (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_boundary_inputs_using_one_default (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_boundary_polygon (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_boundary_polygon_II (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
Same as II but vertices ordered differently ... ok
test_boundary_polygon_IIIa - Check pathological situation where ... ok
Reproduce test test_spatio_temporal_file_function_time ... ok
Create a discontinuous mesh (duplicate vertices) ... ok
test_boundary_polygon_VI(self) ... ok
test_boundary_tags (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_build_neighbour_structure_duplicates (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_general_triangle (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_get_intersecting_segments(self): ... ok
test_get_intersecting_segments(self): ... ok
test_get_intersecting_segments(self): ... ok
test_get_intersecting_segments(self): ... ok
test_get_intersecting_segments(self): ... ok
test_get_intersecting_segments(self): ... ok
test_get_intersecting_segments(self): ... ok
test_get_intersecting_segments_coinciding(self): ... ok
test_get_intersecting_segments_partially_coinciding(self): ... ok
test_get_triangle_containing_point (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_get_triangle_neighbours (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_inputs (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test that the radius is calculated correctly by mesh in the case of an equilateral triangle ... ok
test that the radius is calculated correctly by mesh in the case of a right-angled triangle ... ok
get values based on triangle lists. ... ok
test_lone_vertices (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_mesh_and_neighbours (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_mesh_get_boundary_polygon_with_georeferencing ... ok
test_more_triangles (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_rectangular_mesh (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_rectangular_mesh2 (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_rectangular_mesh_basic (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_surrogate_neighbours (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_triangle_inputs (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_two_triangles (anuga.abstract_2d_finite_volumes.tests.test_neighbour_mesh.Test_Mesh) ... ok
test_pmesh2Domain (anuga.abstract_2d_finite_volumes.tests.test_pmesh2domain.Test_pmesh2domain) ... ok
test_pmesh2Domain_instance (anuga.abstract_2d_finite_volumes.tests.test_pmesh2domain.Test_pmesh2domain) ... ok
test_backup_saxpy_centroid_values (anuga.abstract_2d_finite_volumes.tests.test_quantity.Test_Quantity) ... ok
test_both_updates (anuga.abstract_2d_finite_volumes.tests.test_quantity.Test_Quantity) ... ok
test_boundary_allocation (anuga.abstract_2d_finite_volumes.tests.test_quantity.Test_Quantity) ... ok
test_cache_test_set_values_from_file (anuga.abstract_2d_finite_volumes.tests.test_quantity.Test_Quantity) ... Segmentation fault (core dumped)
JamesRamm commented 7 years ago

I've managed to track the seg fault to the following code:

      quantity.set_values(filename=ptsfile,
                            attribute_name=att,
                            alpha=0,
                            use_cache=True,
                            verbose=False)

line 1095 of test_quantity.py. I'll do a bit more poking around

JamesRamm commented 7 years ago

Ok, so I got the debugger out and followed the call stack of that failing test way down to a function called cg_solve_c_precon, which is called by conjugate_gradient.

This is a c extension (cg_ext.c) which I'm not setup to debug, but hopefully this will be of help to you!

EDIT, a little more info. Running the tests with gdb gives the following output:

(gdb) run runtests.py 
Starting program: /home/james/miniconda3/envs/anuga/bin/python runtests.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Building, see build.log...
Build OK
Running unit tests for anuga
NumPy version 1.13.1
NumPy relaxed strides checking option: True
NumPy is installed in /home/james/miniconda3/envs/anuga/lib/python2.7/site-packages/numpy
Python version 2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:09:15) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
nose version 1.3.7
...............................................................................................[New Thread 0x7fffd8d62780 (LWP 26691)]
[New Thread 0x7fffd8961800 (LWP 26692)]
[New Thread 0x7fffd8560880 (LWP 26693)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffff39060b2 in dcopy_ ()
   from /home/james/miniconda3/envs/anuga/lib/python2.7/site-packages/numpy/core/../../../../libmkl_intel_lp64.so
JamesRamm commented 7 years ago

And the first 10 lines of the stack trace from that seg fault:

#0  0x00007ffff39060b2 in dcopy_ ()
   from /home/james/miniconda3/envs/anuga/lib/python2.7/site-packages/numpy/core/../../../../libmkl_intel_lp64.so
#1  0x00007fffec1645ab in _cg_solve_c_precon (data=0x17eb5a0, colind=0x17ee290, row_ptr=0x178db00, b=0x17ee360, 
    x=0x17ee3e0, imax=1012, tol=1e-08, a_tol=1e-14, M=6, precon=0x17ee3a0) at anuga/utilities/cg_ext.c:321
#2  0x00007fffec164892 in cg_solve_c_precon (self=<optimised out>, args=<optimised out>)
    at anuga/utilities/cg_ext.c:546
#3  0x00007ffff7ad91e5 in call_function (oparg=<optimised out>, pp_stack=0x7fffffff6928) at Python/ceval.c:4352
#4  PyEval_EvalFrameEx (f=<optimised out>, throwflag=<optimised out>) at Python/ceval.c:2989
#5  0x00007ffff7adac3e in PyEval_EvalCodeEx (co=0x7fffec56f0b0, globals=<optimised out>, locals=<optimised out>, 
    args=<optimised out>, argcount=3, kws=0x17edc58, kwcount=3, defs=0x7fffec56d608, defcount=8, closure=0x0)
    at Python/ceval.c:3584
#6  0x00007ffff7ada1f7 in fast_function (nk=<optimised out>, na=3, n=<optimised out>, pp_stack=0x7fffffff6b48, 
    func=0x7fffec9d7c08) at Python/ceval.c:4447
#7  call_function (oparg=<optimised out>, pp_stack=0x7fffffff6b48) at Python/ceval.c:4372
#8  PyEval_EvalFrameEx (f=<optimised out>, throwflag=<optimised out>) at Python/ceval.c:2989
#9  0x00007ffff7adac3e in PyEval_EvalCodeEx (co=0x7fffec9e7a30, globals=<optimised out>, locals=<optimised out>, 
    args=<optimised out>, argcount=3, kws=0x17ec718, kwcount=4, defs=0x7fffec9e4a90, defcount=6, closure=0x0)
    at Python/ceval.c:3584
#10 0x00007ffff7ada1f7 in fast_function (nk=<optimised out>, na=3, n=<optimised out>, pp_stack=0x7fffffff6d68, 
    func=0x7fffec5767d0) at Python/ceval.c:4447
stoiver commented 7 years ago

@JamesRamm great work. The problem seems to be that in cg_ext.c there is a function dcopy, but there is a function of the same name in /home/james/miniconda3/envs/anuga/lib/python2.7/site-packages/numpy/core/../../../../libmkl_intel_lp64.so. Probably the easiest way out of this will be to change the names of the functions in cg_ext.c with lapack type names to something a bit unique. Still a bit strange that the local functions are not being linked.

JamesRamm commented 7 years ago

That is strange. There was a bunch of output from running setup.py build which included some warnings - I could build again to take a closer look at this.

Will setup.py clean remove the current build?

Changing the name to something more unique seems the easiest way out. cg_ext contains a numpy include: #include "numpy/arrayobject.h" and I wonder if this is perhaps including dcopy somewhere in its' references which overrides the local? Although if that is the case I would expect this issue to crop up on everyones build.

stoiver commented 7 years ago

I guess the problem is that conda numpy is linked against libmkl_intel_lp64.so which obviously contains the lapack procedures like dcopy (which has a few extra calling argument, which no doubt caused the segmentation fault)

stoiver commented 7 years ago

To rebuild use python setup.py build --force

JamesRamm commented 7 years ago

Ok I managed to get it to work by removing the MKL optimisations: conda remove mkl conda install nomkl

Then reinstalling numpy, scipy, matplotlib, netcdf4 and gdal is required. The tests will now run through (I do get 26 fails though!).

I imagine this means that the conda instructions (and maybe install_conda.sh?) need updating to account for this (conda remove mkl not necessary if you are installing for the first time!...Just need to install nomkl before installing numpy).

However, perhaps it is desirable to support the MKL extensions; they may bring about some performance improvements?

Anaconda docs on the optimisations and how to uninstall are here: https://docs.continuum.io/mkl-optimizations/

stoiver commented 7 years ago

Fixed in PR #140