Closed mathematicalmichael closed 5 years ago
the commit 052ef518cf18c9d411add3768f0093b2b03a48b9 right before my first ever change is also failing in the same way (in python 2, in fact I couldn't get tests to pass there with 2 either). @eecsu have you ever run into this before by any chance? (failing nosetests with more than 2 processors)
@smattis is this something I shouldn't worry about?
I wonder if this is related to #289
is there anything I can do to help diagnose?
I am just seeing some bizarre behavior overall in parallel stuff with this build. It could be somewhat sensitive to your processor types and MPI distribution. I will play around with it a bit.
I think it is related to #289
I sat down at another computer and started developing locally instead of in my cloud env, and kept running into parallel loading failures even with -n 2
processors. the sampling
module, specifically. @lcgraham
the fact that test files aren't cleaned up after tests complete is something I wonder could be related. If I run serial tests after some parallel ones fail, the serial ones will continue to fail until I clean up all the 1tox, testfilex
etc files.
@smattis I'm still getting unbelievably unstable behavior with parallel saving/loading (seemingly different results every time I try), and it's making debugging build errors in parallel very difficult.
For example, I can never get the adaptiveSampling
tests to pass locally on any of my dev machines/environments but they always pass in Travis, but simultaneously Travis will have parallel tests break that pass on my machine (oddly...seemingly unrelated to loading). Like, I'll get some nonsense like
======================================================================
ERROR: Test saving and loading of discretization
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/mpilosov/Dropbox/Coding/academic/BET/test/test_sample.py", line 882, in test_save_load_discretization
sample.save_discretization(self.disc, file_name, "TEST", globalize)
File "/home/mpilosov/Dropbox/Coding/academic/BET/bet/sample.py", line 1513, in save_discretization
discretization_name + attrname, globalize)
File "/home/mpilosov/Dropbox/Coding/academic/BET/bet/sample.py", line 89, in save_sample_set
new_mdat = sio.loadmat(local_file_name)
File "/home/mpilosov/anaconda3/envs/py37/lib/python3.7/site-packages/scipy-1.2.1-py3.7-linux-x86_64.egg/scipy/io/matlab/mio.py", line 208, in loadmat
matfile_dict = MR.get_variables(variable_names)
File "/home/mpilosov/anaconda3/envs/py37/lib/python3.7/site-packages/scipy-1.2.1-py3.7-linux-x86_64.egg/scipy/io/matlab/mio5.py", line 272, in get_variables
hdr, next_position = self.read_var_header()
File "/home/mpilosov/anaconda3/envs/py37/lib/python3.7/site-packages/scipy-1.2.1-py3.7-linux-x86_64.egg/scipy/io/matlab/mio5.py", line 231, in read_var_header
raise TypeError('Expecting miMATRIX type here, got %d' % mdtype)
TypeError: Expecting miMATRIX type here, got 0
----------------------------------------------------------------------
Ran 209 tests in 8.660s
FAILED (errors=1)
then run the same test again and have it pass. AKA there is some weird async behavior that I do not understand.
I've merged @lcgraham's master branch, resolved conflicts, then tinkered for hours.
Good news is that I have -n 3
working now. Bad news is that our tests still don't pass for -n 4
but .... that error is a new one... not in adaptive sampling.
FAIL: Check save_sample_set and load_sample_set.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/mpilosov/Dropbox/Coding/academic/BET/test/test_sample.py", line 1378, in test_save_load
curr_attr)
File "/home/mpilosov/anaconda3/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 904, in assert_array_equal
verbose=verbose, header='Arrays are not equal')
File "/home/mpilosov/anaconda3/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 752, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Arrays are not equal
(shapes (0,), (0, 0) mismatch)
x: array([], dtype=float64)
y: array([], shape=(0, 0), dtype=float64)
that seems fixable...
then once that is fixed, I need to see if tests still pass when I enforce lines 182-186 in test_adaptive_samping
4 seems to fail because that particular test has num = 3
, so that now makes sense...
It is the rectangle set that is causing this. This caps our tests at 3 processors.
I've extended the test to handle a user-specified maximum number of processors. I'm choosing 8. The rectangles will be generated on the diagonal, there will be exactly one per processor. Nearest-neighbor test (query) will query a point close to the corner of each rectangle in order, so that the indices are 0,1,2, etc (or flipped, due to how the rectangle set is instantiated)
had to fix ball sample set as well as rectangle. now both work for n>4
test_sample
now passes up to 8 processors with the settings I put in. Seems like this limit can be made as high as desired by tweaking self.nprocs
, which I added to the cartesian/ball/rectangle sample set tests. having merged #289 into a dev branch of my own, I'm getting closer to fixing this...
addressed by https://github.com/UT-CHG/BET/pull/355, closing now
Are the nosetests expected to pass for more than 2 processors? Our travis file only tests this example, and I recently started trying other numbers with mixed success. Are some of the tests too small to break up into too many processors? @smattis
I've tried this in a number of environments. Always the same story. Fresh clone, pass for n=1,2, error for 3 onwards. 1 error. for 8 processors, i started to get failures=3.
tried checkout out to before any of my commits started, b8b558137fbcf6cfa9eb2f312e65761ef9a1bd46, and it was happening there as well.
alll the way back to 937da494dab6ebe03b6ce2faa762fb23d7e5da5f back when python 2 was the only supported language, it errored out as well.
entering the testing stages of my modules and stumbled across this when accidentally hitting a 3 instead of 2.
since this error appears to be persistent throughout, I will continue to test with
-n 2