UT-CHG / BET

Python package for data-consistent stochastic inverse and forward problems.
http://ut-chg.github.io/BET
Other
11 stars 21 forks source link

mpirun -n 3 nosetests #333

Closed mathematicalmichael closed 5 years ago

mathematicalmichael commented 5 years ago

Are the nosetests expected to pass for more than 2 processors? Our travis file only tests this example, and I recently started trying other numbers with mixed success. Are some of the tests too small to break up into too many processors? @smattis

I've tried this in a number of environments. Always the same story. Fresh clone, pass for n=1,2, error for 3 onwards. 1 error. for 8 processors, i started to get failures=3.

tried checkout out to before any of my commits started, b8b558137fbcf6cfa9eb2f312e65761ef9a1bd46, and it was happening there as well.

alll the way back to 937da494dab6ebe03b6ce2faa762fb23d7e5da5f back when python 2 was the only supported language, it errored out as well.

entering the testing stages of my modules and stumbled across this when accidentally hitting a 3 instead of 2.

since this error appears to be persistent throughout, I will continue to test with -n 2

mathematicalmichael commented 5 years ago

the commit 052ef518cf18c9d411add3768f0093b2b03a48b9 right before my first ever change is also failing in the same way (in python 2, in fact I couldn't get tests to pass there with 2 either). @eecsu have you ever run into this before by any chance? (failing nosetests with more than 2 processors)

mathematicalmichael commented 5 years ago

@smattis is this something I shouldn't worry about?

smattis commented 5 years ago

I wonder if this is related to #289

mathematicalmichael commented 5 years ago

is there anything I can do to help diagnose?

smattis commented 5 years ago

I am just seeing some bizarre behavior overall in parallel stuff with this build. It could be somewhat sensitive to your processor types and MPI distribution. I will play around with it a bit.

mathematicalmichael commented 5 years ago

I think it is related to #289 I sat down at another computer and started developing locally instead of in my cloud env, and kept running into parallel loading failures even with -n 2 processors. the sampling module, specifically. @lcgraham

mathematicalmichael commented 5 years ago

the fact that test files aren't cleaned up after tests complete is something I wonder could be related. If I run serial tests after some parallel ones fail, the serial ones will continue to fail until I clean up all the 1tox, testfilex etc files.

mathematicalmichael commented 5 years ago

@smattis I'm still getting unbelievably unstable behavior with parallel saving/loading (seemingly different results every time I try), and it's making debugging build errors in parallel very difficult.

For example, I can never get the adaptiveSampling tests to pass locally on any of my dev machines/environments but they always pass in Travis, but simultaneously Travis will have parallel tests break that pass on my machine (oddly...seemingly unrelated to loading). Like, I'll get some nonsense like

======================================================================
ERROR: Test saving and loading of discretization
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mpilosov/Dropbox/Coding/academic/BET/test/test_sample.py", line 882, in test_save_load_discretization
    sample.save_discretization(self.disc, file_name, "TEST", globalize)
  File "/home/mpilosov/Dropbox/Coding/academic/BET/bet/sample.py", line 1513, in save_discretization
    discretization_name + attrname, globalize)
  File "/home/mpilosov/Dropbox/Coding/academic/BET/bet/sample.py", line 89, in save_sample_set
    new_mdat = sio.loadmat(local_file_name)
  File "/home/mpilosov/anaconda3/envs/py37/lib/python3.7/site-packages/scipy-1.2.1-py3.7-linux-x86_64.egg/scipy/io/matlab/mio.py", line 208, in loadmat
    matfile_dict = MR.get_variables(variable_names)
  File "/home/mpilosov/anaconda3/envs/py37/lib/python3.7/site-packages/scipy-1.2.1-py3.7-linux-x86_64.egg/scipy/io/matlab/mio5.py", line 272, in get_variables
    hdr, next_position = self.read_var_header()
  File "/home/mpilosov/anaconda3/envs/py37/lib/python3.7/site-packages/scipy-1.2.1-py3.7-linux-x86_64.egg/scipy/io/matlab/mio5.py", line 231, in read_var_header
    raise TypeError('Expecting miMATRIX type here, got %d' % mdtype)
TypeError: Expecting miMATRIX type here, got 0

----------------------------------------------------------------------
Ran 209 tests in 8.660s

FAILED (errors=1)

then run the same test again and have it pass. AKA there is some weird async behavior that I do not understand.

mathematicalmichael commented 5 years ago

I've merged @lcgraham's master branch, resolved conflicts, then tinkered for hours. Good news is that I have -n 3 working now. Bad news is that our tests still don't pass for -n 4 but .... that error is a new one... not in adaptive sampling.

FAIL: Check save_sample_set and load_sample_set.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mpilosov/Dropbox/Coding/academic/BET/test/test_sample.py", line 1378, in test_save_load
    curr_attr)
  File "/home/mpilosov/anaconda3/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 904, in assert_array_equal
    verbose=verbose, header='Arrays are not equal')
  File "/home/mpilosov/anaconda3/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 752, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not equal

(shapes (0,), (0, 0) mismatch)
 x: array([], dtype=float64)
 y: array([], shape=(0, 0), dtype=float64)

that seems fixable...

then once that is fixed, I need to see if tests still pass when I enforce lines 182-186 in test_adaptive_samping

mathematicalmichael commented 5 years ago

4 seems to fail because that particular test has num = 3, so that now makes sense...

It is the rectangle set that is causing this. This caps our tests at 3 processors.

I've extended the test to handle a user-specified maximum number of processors. I'm choosing 8. The rectangles will be generated on the diagonal, there will be exactly one per processor. Nearest-neighbor test (query) will query a point close to the corner of each rectangle in order, so that the indices are 0,1,2, etc (or flipped, due to how the rectangle set is instantiated)

mathematicalmichael commented 5 years ago

had to fix ball sample set as well as rectangle. now both work for n>4

mathematicalmichael commented 5 years ago

test_sample now passes up to 8 processors with the settings I put in. Seems like this limit can be made as high as desired by tweaking self.nprocs, which I added to the cartesian/ball/rectangle sample set tests. having merged #289 into a dev branch of my own, I'm getting closer to fixing this...

mathematicalmichael commented 5 years ago

addressed by https://github.com/UT-CHG/BET/pull/355, closing now