CovertLab / wcEcoli

Whole Cell Model of E. coli
Other
18 stars 4 forks source link

the parallel Fitter does not produce results equivalent to the serial Fitter #196

Closed 1fish2 closed 5 years ago

1fish2 commented 6 years ago

This #187 comment points out that 1-core and 8-core Fitter runs created different size kb/simData_Fit_1.cPickle files. This is dubious.

Are the results equivalent? Does it just mean some data fields are in a different order and data-dependent encoding (compression or variable-length offsets or whatever) accounts for the different sizes?

  1. Make these objects support an == test so it's straightforward to test that the cPickle contents are functional equivalent.
  2. Write a test program. It will be slow although we could speed it up by using the Fitter's debug feature if we make that feature select a consistent transcription factor rather than an arbitrary one.
  3. Run the test in the nightly build, or the PR build, or maybe less often.
jmason42 commented 6 years ago

Agreed on all three points. Nightly build would probably be frequent enough.

I'm not sure how practical it is (@tahorst would have a much better idea), but functionalizing these operations would make it easier to avoid serial/parallel differences and would likely cut down on parallelization overhead (right now a huge chunk of memory has to be copied in/out). If functionalization isn't practical, then maybe we need to trim sim_data down. Ideally sim_data would strictly be model parameters and raw_data would be, as the name implies, data taken directly from primary sources and loaded into Python objects. (A third piece which often gets associated with both objects is 'structural' data like the reaction stoichiometries, which aren't parameters in the normal sense.)

Now that I think about it, maybe raw_data should be a module instead of a class/instance, since it (should be) static. Then it's much easier to load piecemeal (which is great for e.g. parallel operations that don't need to see most of the raw_data). In contrast, one big sim_data object makes more sense because those parameter values all belong together, and need to be passed simultaneously to initialize a simulation. Just some food for thought - historically, raw_data and sim_data were originally the same object.

tahorst commented 6 years ago

I think supporting == and running it as part of the build that gets run any time a new commit is added to master would be good. That build already runs a serial and parallel fitter task so it would be a simple comparison. I think they should be equivalent but it's possible something gets set within the function so functionalizing everything would make things more clear.

1fish2 commented 6 years ago

As @jmason42 points out elsewhere, seeding the random number generators differently is one potential cause for different parallel/serial Fitter results.

1fish2 commented 5 years ago

The parallel Fitter does not produce equivalent output, at least not on macOS, judging by the output from diff_simouts.py.

Below, out/manual/ has the output from a parallel Fitter run w/12 worker processes then one simulation generation, while out/x4/ has the output from a serial Fitter run then one simulation generation.

Parallel Fitter run: 3m 44s Serial Fitter run: 15m 55s -- 4.3x

Comparing: ('out/manual/wildtype_000000/000000/generation_000000/000000/simOut/', 'out/x4/wildtype_000000/000000/generation_000000/000000/simOut/')

{'BulkMolecules/counts': Arrays are not equal (mismatch 7.04940792673e-06%) x: array([[0, 0, 0, ..., 1, 1, 1], [0, 0, 0, ...,...,
 'EnzymeKinetics/actualFluxes': Arrays are not equal (mismatch 6.95512548153%) x: array([[0.000000e+00, 0.000000e+00, 0.000000e+00, ...,
 'EnzymeKinetics/countsToMolar': Arrays are not equal (mismatch 0.271923861319%) x: array([[0.000000e+00], [1.364496e-06], [1.364496e...,
 'EnzymeKinetics/metaboliteConcentrations': Arrays are not equal (mismatch 0.271923861319%) x: array([[0.000000e+00, 0.000000e+00, 0.000000e+00,...,
 'EnzymeKinetics/metaboliteCountsFinal': Arrays are not equal (mismatch 0.00192853802353%) x: array([[0.000000e+00, 0.000000e+00, 0.000000e+0...,
 'EnzymeKinetics/metaboliteCountsInit': Arrays are not equal (mismatch 0.00192853802353%) x: array([[0.000000e+00, 0.000000e+00, 0.000000e+0...,
 'EnzymeKinetics/targetFluxes': Arrays are not equal (mismatch 0.263426240653%) x: array([[0.000000e+00, 0.000000e+00, 0.000000e+00,...,
 'FBAResults/deltaMetabolites': Arrays are not equal (mismatch 0.000723201758831%) x: array([[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ....,
 'FBAResults/externalExchangeFluxes': Arrays are not equal (mismatch 15.9860330017%) x: array([[ 0.000000e+00, 0.000000e+00, 0.000000e+00,...,
 'FBAResults/homeostaticObjectiveValues': Arrays are not equal (mismatch 0.00289280703531%) x: array([[0. , 0. , 0. , ..., 0. , 0. , 0. ], [0....,
 'FBAResults/kineticObjectiveValues': Arrays are not equal (mismatch 6.20166978246%) x: array([[ 0. , 0. , 0. , ..., 0. , 0. , 0. ], [ 0. ...,
 'FBAResults/objectiveValue': Arrays are not equal (mismatch 99.9660095173%) x: array([[0. ], [0.021562], [0.016488],... y: array(...,
 'FBAResults/reactionFluxes': Arrays are not equal (mismatch 7.96058783244%) x: array([[0. , 0. , 0. , ..., 0. , 0. , 0. ], [0. , ...,
 'FBAResults/reducedCosts': Arrays are not equal (mismatch 22.7123313665%) x: array([[0.000000e+00, 0.000000e+00, 0.000000e+00, ...,
 'FBAResults/shadowPrices': Arrays are not equal (mismatch 98.7348790566%) x: array([[ 0.000000e+00, 0.000000e+00, 0.000000e+00,...,
 'GrowthLimits/aaRequestSize': Arrays are not equal (mismatch 0.249263539542%) x: array([[0.000000e+00, 0.000000e+00, 0.000000e+00,...,
 'Main@endTime': (u'2018-11-28 14:09:16', u'2018-11-27 23:53:58'),
 'Main@startTime': (u'2018-11-28 14:00:03', u'2018-11-27 23:45:34'),
 'Mass/cellMass': Arrays are not equal (mismatch 0.271923861319%) x: array([[1339.133243], [1339.133239], [1339.133235...,
 'Mass/cellVolume': Arrays are not equal (mismatch 0.271923861319%) x: array([[1.217394], [1.217394], [1.217394],... y: ...,
 'Mass/dryMass': Arrays are not equal (mismatch 0.271923861319%) x: array([[403.237634], [403.32835 ], [403.391711],....,
 'Mass/growth': Arrays are not equal (mismatch 0.238014280857%) x: array([0.090716, 0.063361, 0.043779, ..., 0.16633...,
 'Mass/instantaniousGrowthRate': Arrays are not equal (mismatch 0.306018361102%) x: array([0.001125, 0.000785, 0.000543, ..., 0.00023...,
 'Mass/processMassDifferences': Arrays are not equal (mismatch 0.0183025675888%) x: array([[ 0.000000e+00, 0.000000e+00, 0.000000e+0...,
 'Mass/relProcessMassDifferences': Arrays are not equal (mismatch 0.0235318726141%) x: array([[ 0.000000e+00, 0.000000e+00, 0.000000e+0...,
 'Mass/waterMass': Arrays are not equal (mismatch 0.271923861319%) x: array([[ 935.89561 ], [ 935.804889], [ 935.741524...,
 'ReplicationData/criticalMassPerOriC': Arrays are not equal (mismatch 0.271923861319%) x: array([[0. ], [0.686735], [0.686735],... y: array...,
 'RnaDegradationListener/DiffRelativeFirstOrderDecay': Arrays are not equal (mismatch 20.4962610469%) x: array([[0. ], [0.094397], [0.098627],... y: array(...,
 'RnaDegradationListener/FractionActiveEndoRNases': Arrays are not equal (mismatch 0.271923861319%) x: array([[0. ], [0.050895], [0.050949],... y: array...,
 'RnaSynthProb/rnaSynthProb': Arrays are not equal (mismatch 86.7990600192%) x: array([[0. , 0. , 0. , ..., 0. , 0. , 0. ], [0. , ...}
tahorst commented 5 years ago

Hmm looks like the major issue is rnaSynthProb. I'd expect that difference to eventually trickle down to some of the other processes. Any ideas how to troubleshoot? I would guess that one of the functions in fit_sim_data_1 that are parallelized updates something in place in addition to returning values. Might be difficult to track down exactly where it's different without a tool to compare sim_data objects.

1fish2 commented 5 years ago

Yes, to troubleshoot we need testable hypotheses.

prismofeverything commented 5 years ago

Through analysis we can definitely narrow down the search for where the differences arise. If we want to compare sim_data then https://github.com/CovertLab/wcEcoli/pull/282 is already most of the way there. I can put together a quick tree diff for that, it should tell us where the problem is.

I do know that a function operating in parallel cannot reach back and modify anything in the parent process. At least, that is how multiprocessing works (any necessary communication is mediated entirely through its Queue class), but that library is dead to me now. So that should cut out a whole class of hypotheses.

1fish2 commented 5 years ago

Yes, diff trees would be useful, esp. if computed at intermediate stages.

prismofeverything commented 5 years ago

Yes, diff trees would be useful, esp. if computed at intermediate stages.

Okay, I am diffing sim_data objects in https://github.com/CovertLab/wcEcoli/pull/395 and I compared a serial fitter output to the parallel fitter output and got no differences. Like I mention in the PR that may be because I am not comparing Unums correctly, I am eliminating that possibility now.

I believe it is this library? https://github.com/trzemecki/Unum He does not describe comparison, but using python == fails with these values so I am currently using np.array_equal until I find something better.

prismofeverything commented 5 years ago

Ah found it, you can call number() on them to get the numpy array out. Okay, testing again....

jmason42 commented 5 years ago

Stupid question: are we sure that two runs of the serial fitter are identical? Likewise for the parallel fitter.

1fish2 commented 5 years ago

That's an excellent question!

prismofeverything commented 5 years ago

I have verified that the fitter output (sim_data) for both serial and parallel is identical, at least as far as diff_trees is able to discern.

jmason42 commented 5 years ago

I have verified that the fitter output (sim_data) for both serial and parallel is identical, at least as far as diff_trees is able to discern.

Huh, that's interesting. I would have expected the parallel fitter to exhibit arbitrary behavior, given these issues. I suppose there is a good chance that the execution order might be the same between two parallel runs. Maybe it would change if we reordered the parallel execution calls (scramble the order, or just reverse it - either would almost guarantee a different execution order).

1fish2 commented 5 years ago

Of 3 serial and 3 parallel Fitter runs on Sherlock, one run produced different results while the other 5 produced equivalent results (as reported by the latest compareFitter.py).

Of 5 serial runs and 5 parallel runs on my MacBook, none of them seemed to produce equivalent results, although I mostly only compared adjacent runs.

Hypothesis: Are these differences at least partly due to a run-time environment sensitivity?

Hypothesis: Is something initializing random number generators differently?

Some Fitter run times

Some Fitter diffs

tahorst commented 5 years ago

That is very strange that some are consistent but others aren't. Those execution times in parallel seem way too short. I've never seen anything less than ~8 min or so. Did you check the actual timestamps or just the script total time that gets printed at the end? I've seen some weird stats on that at times.

Hypothesis: Is something initializing random number generators differently?

Along those lines, are we sure everything is using the proper seed? If there is one thing that isn't using our seeded object then it could throw everything off downstream. Might be less likely if there were that many that were the same but if it's one number that is stochastically rounded it could be likely.

jmason42 commented 5 years ago

It's good to see that the error is (always?) in the last few bits of the mantissa, but troubling that there is any error at all. The MBP-specific issues make me think this is a library thing - maybe a SciPy dependency? I can't make any sense of that Sherlock diff.

@tahorst Good point on the seeding. I had thought there was more random number generation, but it looks like the only RNG is in the calls to the Cythonized wholecell.utils.mc_complexation.mccFormComplexesWithPrebuiltMatrices, which ought to be deterministic (and hopefully platform-independent) given @1fish2's rewrite to use a numpy.random.RandomState object in the Cython code. That said, maybe the Cython code would be a good thing to check - do we have a determinism test for that code?

prismofeverything commented 5 years ago

Hey @1fish2, here is the result of me running python runscripts/debug/summarize_environment.py on the macbook:

multiprocessing 0.70a1
multiprocessing.cpu_count(): 8

numpy 1.14.5
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blis_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE

os 
os.environ:
  'BOOST_NUMPY_LIB': --
  'HOME': '/Users/rspangler'
  'LIBRARY_PATH': --
  'PI_HOME': --
  'PYENV_ROOT': '/Users/rspangler/.pyenv'
  'PYTHONPATH': '/Users/rspangler/Code/wcEcoli'
  'SHERLOCK': --
os.getcwd(): /Users/rspangler/Code/wcEcoli
os.uname(): ('Darwin', 'omniomnibus', '17.7.0', 'Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64', 'x86_64')

scipy 1.0.1
lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3']
    define_macros = [('NO_ATLAS_INFO', 3)]
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
    define_macros = [('NO_ATLAS_INFO', 3)]
openblas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

sys 
sys.platform: darwin
sys.prefix: /Users/rspangler/.pyenv/versions/wcEcoli2
sys.version: 2.7.14 (default, Jul 17 2018, 09:59:22) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)]
sys.api_version: 1013
sys.version_info: sys.version_info(major=2, minor=7, micro=14, releaselevel='final', serial=0)

So looks like I am still using the openblas build for numpy, but accelerate for scipy? Which could account for the differences we are seeing.

I have yet to create any differences caught by diff_trees on either ubuntu or mac now, but cmp does show a difference:

(wcEcoli2) rspangler@omniomnibus:~/Code/wcEcoli$ cmp out/serial/kb/simData_Fit_1.cPickle out/serial2/kb/simData_Fit_1.cPickle 
out/serial/kb/simData_Fit_1.cPickle out/serial2/kb/simData_Fit_1.cPickle differ: char 147089, line 287

This could be dictionary hashing order? Or something else? The contents do not differ at least.

I will run a few sims with these, but I am starting to suspect the Accelerate framework for numpy since that is the only tangible difference between our systems I can think of.

1fish2 commented 5 years ago

Those execution times in parallel seem way too short. I've never seen anything less than ~8 min or so. Did you check the actual timestamps or just the script total time that gets printed at the end? I've seen some weird stats on that at times.

Indeed, the printed timestamps and the iTerm2 timestamps are larger than those numbers. In the code, time.clock() measures the “processor time” ... “this is the function to use for benchmarking Python or timing algorithms” ... except when they use subprocesses. (LOL) I'm fixing that.

This is better:

$ python runscripts/manual/runFitter.py p8 -c8
Mon Dec  3 14:45:06 2018: RunFitter at /Users/jerry/dev/wcEcoli/out/p8
{'Arguments': {'cached': False,
               'cpus': 8,
               'debug': False,
               'ribosome_fitting': True,
               'rnapoly_fitting': True,
               'sim_outdir': 'p8',
               'sim_path': '/Users/jerry/dev/wcEcoli/out/p8',
               'verbose': False}}
...
Mon Dec  3 14:52:32 2018: Elapsed time 445.9 secs (0:07:25.926408); 300.9 secs (0:05:00.947440) in process

Hmm, I'll simplify it:

Mon Dec  3 15:24:13 2018: Elapsed time 700.81 secs (0:11:40.811176); 298.89 in process

-c8 took 445.9 secs; 300.9 secs in process with no cached km.cPickle. -c2 took 700.81 secs; 298.89 in process with no cached km.cPickle.

1fish2 commented 5 years ago

Key deltas (details below) in summarize_environment.py between my MBP and @prismofeverything's:

The entire computation ought to be deterministic, including dictionary hashing (unless the hash bucket sizes depend on available RAM) and floating point computation. The Sherlock diff suggests there might be more than one cause of variations.

Proposed next steps:

lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3']
    define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
openblas_lapack_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
openblas_clapack_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
    define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
blis_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE

os.uname(): ('Darwin', 'Jerrys-MacBook-Pro.local', '18.2.0', 'Darwin Kernel Version 18.2.0: Fri Oct  5 19:41:49 PDT 2018; root:xnu-4903.221.2~2/RELEASE_X86_64', 'x86_64')

scipy 1.0.1
lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3']
    define_macros = [('NO_ATLAS_INFO', 3)]
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
    define_macros = [('NO_ATLAS_INFO', 3)]
openblas_info:
  NOT AVAILABLE
...

sys
sys.version: 2.7.15 (default, Jul 26 2018, 17:13:52)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)]
1fish2 commented 5 years ago

I meant to say pretty please when sugar on top. :-)

From the Python 2.7.15rc1 release notes:

Prevent unwanted behavior in _random.Random.seed() in case the argument has a bad abs() method. Patch by Oren Milman.

Preserve generator state when _random.Random.setstate() raises an exception. Patch by Bryan Olson.

Support glibc 2.24 on Linux: don't use getentropy() function but read from /dev/urandom to get random bytes, for example in os.urandom(). On Linux, getentropy() is implemented which getrandom() is blocking mode, whereas os.urandom() should not block.

... and 2 more references to "random".

Work around a gc.disable() race condition in the subprocess module that could leave garbage collection disabled when multiple threads are spawning subprocesses at once. Users are strongly encouraged to use the subprocess32 module from PyPI on Python 2.7 instead, it is much more reliable.

Issue #166: Use the subprocess32 pip instead of subprocess. But I think we only use this Python library for little command line calls like "git rev-parse HEAD".

Update zlib to 1.2.11.

prismofeverything commented 5 years ago

I'll switch numpy to openblas. Ryan, please send me the instructions you used. Do you have info on how to switch back? Ryan, please update to Python 2.7.15 and retest: PYTHON_CONFIGURE_OPTS="--enable-shared" pyenv install 2.7.15 Ryan, please update Xcode tools and retest: xcode-select --install

Currently reinstalling my pyenv for this to update python to 2.7.15, must have missed that update on this computer. This does mean I will be back to the accelerate version of numpy and can test that next. To compile numpy with openblas I followed this: https://stackoverflow.com/questions/11443302/compiling-numpy-with-openblas-integration

If you're up for switching to Mojave, be sure that FUSE is up to date beforehand and reinstall Xcode tools afterwards.

I don't think I'm ready for Mojave yet.... this is an old system and currently stable, I'm going to leave that alone. Let's see how the tests work out here.

1fish2 commented 5 years ago

Cool.

Whoa, that's a lot of steps to build OpenBLAS and then compile numpy with it! Many ways to go wrong. It'd help if there are pip install options, but I'm not finding any. (The Intel Python Distribution is sure to have all the subtle linkages and dependencies right...)

Some of these treediffs show individual numbers differing in the low order digits. This is different between runs, not between machines, which is puzzling. The numpy array diffs don't show the actual differences, which could be in the elided values or in the rounded digits. compareFitter.py might need adjustable verbosity.

Occasionally, some Matrix derivatives like 'process/equilibrium/derivativesJacobianSymbolic' are present in some runs and absent in others such as my Sherlock runs s1 vs. s2 (group readable in $SCRATCH). Can sporadic non-zero values determine whether to include a Matrix or not?

prismofeverything commented 5 years ago

Yep, getting diffs now. Now the question is: was it going back to accelerate? Or upgrading to python 2.7.15? I will do those steps to compile numpy against openblas and rule out the last possibility.

1fish2 commented 5 years ago

Nice!

Good news:

Not great news:

1fish2 commented 5 years ago

So the output varies between runs, even with the serial Fitter.

I'm changing compareFitter.py to allow some floating point tolerance, but the NumPy functions which support that don't handle inf and NaN values, nor structured dtypes. There are lots of inf values in these arrays.

I'm seeing the problem frequently with macOS Mojave on Python 2.7.15 with Accelerate framework, Python 2.7.14 with Accelerate framework, Python 2.7.15 with Intel NumPy, SciPy, numba, numexpr, Scikit-learn, and TBB on MKL, and less often on Sherlock with Python 2.7.15 on BLAS.

Investigating this issue, there are some sources of non-determinism:

prismofeverything commented 5 years ago

Okay, just tried with openblas again and python 2.7.15 this time. No diffs. I'll run a couple more but it is starting to look pretty convincing that accelerate is at least one component of the problem.

Not sure what to say about sherlock except that it has been historically pretty unreliable and I have observed some highly suspicious behavior (a file descriptor being corrupted from one login node, but fine from another when accessing the same file, among other things. I have a whole spreadsheet of the lab's collective sherlock problems). I wouldn't be surprised if it was a different source of variability on the sherlock system than the mac environment.

1fish2 commented 5 years ago

That's good to know.

Maybe Mojave or Core i9 is another component of the problem. MKL didn't fare much better than Accelerate. I was looking into sources of nondeterminism.

I should test openblas. Is there a sane way to install it? ... This NumPy Issue says openblas is in the numpy wheels distribution.

BTW something was wrong with my timing measurements. So np.empty() is faster than np.zeros() by about 6x.

jmason42 commented 5 years ago

Couple thoughts, probably not helpful:

prismofeverything commented 5 years ago

I should test openblas. Is there a sane way to install it?

I just followed the steps in the accepted answer in that SO post.... I had to do some acrobatics to get gfortran installed but otherwise it was straightforward enough. I've done it twice now if that's any testament : )

Also, just checked again on ubuntu (where numpy is built against openblas by default) and I get no diffs.

1fish2 commented 5 years ago

Measuring performance is tricky! Calling np.empty in a loop can discard the array on each iteration and then simply reallocate the same one or two memory nodes. That measurement is independent of the requested array size.

This test shows that numpy.empty() does return an uninitialized array and it can easily and quickly recycle a recent array:

np.arange(100); np.empty(100, int); np.empty(100, int)
np.arange(100, 200.0); np.empty(100, float); np.empty(100, float)

To prevent recycling (measured on MBP i9):

>>> from timeit import timeit

>>> timeit('l.append(numpy.empty(100000))', 'import numpy; l = []')
3.0151820182800293
>>> timeit('l.append(numpy.empty(10000))', 'import numpy; l = []')
1.1399481296539307
>>> timeit('l.append(numpy.empty(10000))', 'import numpy; l = []')
1.244724988937378

>>> timeit('l.append(numpy.zeros(100000))', 'import numpy; l = []')
3.100321054458618
>>> timeit('l.append(numpy.zeros(100000))', 'import numpy; l = []')
3.4001340866088867
>>> timeit('l.append(numpy.zeros(10000))', 'import numpy; l = []')
54.84381604194641  # WAT?
>>> timeit('l.append(numpy.zeros(1000))', 'import numpy; l = []')
5.2339441776275635  # HUH?

Initializing an ndrarry from a list can take 500x as long. Setting number=10000 instead of the default 1M:

>>> timeit('l.append(numpy.array(x))', 'import numpy; l = []; x = 100000 * [1.1]', number=10000)
19.110800981521606

(Trying that on Sherlock can easy get a job killed for lack of memory.)

Sherlock seems much slower at zeroing out an array. (I think that happens inside C calloc.)

>>> timeit('l.append(numpy.zeros(10000))', 'import numpy; l = []', number=1000)
0.06631803512573242
>>> timeit('l.append(numpy.zeros(10000))', 'import numpy; l = []', number=1000)
0.06618595123291016

>>> timeit('l.append(numpy.empty(10000))', 'import numpy; l = []', number=1000)
0.00645899772644043
>>> timeit('l.append(numpy.empty(10000))', 'import numpy; l = []', number=1000)
0.006451845169067383

>>> timeit('l.append(numpy.zeros(10001))', 'import numpy; l = []', number=1000)
0.06612896919250488
>>> timeit('l.append(numpy.zeros(10001))', 'import numpy; l = []', number=1000)
0.06485104560852051

>>> timeit('l.append(numpy.empty(10002))', 'import numpy; l = []', number=1000)
0.006468057632446289
>>> timeit('l.append(numpy.empty(10002))', 'import numpy; l = []', number=1000)
0.006554126739501953

>>> timeit('l.append(numpy.zeros(10002))', 'import numpy; l = []', number=1000)
0.06684994697570801
>>> timeit('l.append(numpy.zeros(10002))', 'import numpy; l = []', number=1000)
0.06575298309326172
1fish2 commented 5 years ago

Yay, NumPy + SciPy on OpenBLAS is producing serial and parallel Fitter runs with results that match each other and most Sherlock runs!

Summary:

jmason42 commented 5 years ago

Very interesting. I wonder if library linking issues also explain the discrepancies I saw on Sherlock. Wouldn't shock me - did I tell you two about the time I found different versions of git installed on different nodes?

1fish2 commented 5 years ago

I'm wondering same thing about discrepancies on Sherlock.

runscripts/debug/summarize_environment.py prints numpy sections like this on both Sherlock and my local BLAS pyenv:

lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']   # or ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c

while the scipy sections look like this on Sherlock:

lapack_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77

vs. local:

lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c

(Different versions of git on different Sherlock nodes? Ugh. And they're old -- the stash commands are different.)

1fish2 commented 5 years ago

Instructions to fix the non-determinism on your local machine by installing openblas + numpy + scipy are now in the wiki.

In brief: Compile openblas from source [the 0.3.5 release will be easier], then create a ~/.numpy-site.cfg file, then pip install numpy==1.14.5 scipy==1.0.1 --no-binary numpy,scipy.

1fish2 commented 5 years ago

Sherlock's pyenv wcEcoli3 now has OpenBLAS compiled from current source (0.3.4+) with NumPy and SciPy compiled for that.

Several serial and parallel runs produced the same output, except occasionally the derivatives matrices are different -- but did match an older run. (See the first part of issuecomment-443575466 above for an example.) So scratch my hypothesis about the derivatives matrices.

Is the fixtures/endo_km/km.cPickle part of the computation where this difference comes in?

jmason42 commented 5 years ago

I could be wrong (@ggsun and @heejochoi may know better) but I think those derivatives are generated when instantiating sim_data, and are not modified by the fitter at all. But I've always had an extremely difficult time following this logic. Anyway, if I'm right, then the issue can be tested without running the fitter. Furthermore I'm inclined to believe that the errors crop up in sympy since this is almost the only place where that library is called.

prismofeverything commented 5 years ago

Yay, NumPy + SciPy on OpenBLAS is producing serial and parallel Fitter runs with results that match each other and most Sherlock runs!

Yeah, nice! So, this is the second issue with the "accelerate" framework (the first being the segfaults with multiprocessing), why is apple trying to replace openblas? Its aversion to goodness and light? Openblas has been working for decades and widely considered unimprovable, which apple's demonstration supports so far. Maybe it's Fortran prejudice ; )

I know I will be compiling everything to use openblas from now on.

Thanks for all the investigation on this @1fish2, time to merge treediff yet?

1fish2 commented 5 years ago

OpenBLAS fixed the threading bugs that seemed to lead to non-determinism only in the last week or two.

Does that fix the fork segfaults?

Does it fix the parallel slowdown? Intel's tbb can fix the over-subscription problem.

Yes, it's time to merge in the treediff branch.

We could probably use this fix to close the parallel fitter issue but open a new one about the derivatives.

I think Apple pushed Intel and Moto for vector instructions. My guess is Accelerate was designed to use them when nothing else did.

prismofeverything commented 5 years ago

Does that fix the fork segfaults?

Before, switching to openblas shifted the segfault to later in the operation. So probably it cleared up one of the segfaults and revealed a different one. That said, in that environment I had only compiled numpy against openblas, scipy was still using the accelerate framework, so perhaps that would fix the problem now that we know about the issues with accelerate.

1fish2 commented 5 years ago

AFAIK the main problem is fixed by making NumPy and SciPy use OpenBLAS 0.3.4+, so I'll close this Issue. See also