Open tahorst opened 4 years ago
I've found simData from runParca.py to differ between my local machine and google cloud machines, and between different google cloud machines (one was running Debian, the other Ubuntu). That was true even before the py3 migration
Hypotheses:
Serial vs parallel results were the same locally and on Sherlock so it seems to be reproducible on a given machine, which is good, but not reproducible between. It is interesting that this showed up before py3 on google cloud but it was at least consistent on my machine vs sherlock before the switch. Certainly some good hypotheses Jerry!
I checked again and it looks like we might have consistent parca outputs (maybe it was a library issue with the recent update to Theano) but there is still a sim diff between the two. It looks like we've run into another OpenBLAS thread issue. I pushed a branch that does a dot product (arrays were pulled from a sim that was producing different results) and produces different results depending on the OPENBLAS_NUM_THREADS environment variable. I'm not sure if there's an easy way to turn this into a unit test since I don't think we can change the number of threads after loading numpy but this can at least show the problem.
Local machine:
$ cd test-openblas
$ for c in {1..12}; do OPENBLAS_NUM_THREADS=$c ./test.py; done
0.016683805584112754
0.016683805584112615
0.016683805584112702
0.016683805584112643
0.016683805584112643
0.016683805584112647
0.016683805584112532
0.016683805584112532
0.016683805584112647
0.016683805584112643
0.01668380558411267
0.016683805584112536
Sherlock with wcEcoli3 and 8 cpus:
$ for c in {1..12}; do OPENBLAS_NUM_THREADS=$c ./test.py; done
0.016683805584112754
0.016683805584112615
0.016683805584112702
0.016683805584112643
0.016683805584112643
0.016683805584112647
0.016683805584112532
0.016683805584112532
0.016683805584112532
0.016683805584112532
0.016683805584112532
0.016683805584112532
The old pyenv (wcEcoli2) on Sherlock gave consistent results:
$ for c in {1..12}; do OPENBLAS_NUM_THREADS=$c ./test.py; done
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
Awesome! A compact, repeatable test case is usually the hardest step in debugging.
We could probably make a unit test that uses multiprocessing to exec a process with a different OPENBLAS_NUM_THREADS
.
The environment variable NUM_THREADS
supposedly affects compiling OpenBLAS by setting the max number of threads. Otherwise it tries to figure that out from the hardware it compiles on, which means we should be careful to not compile it on a Sherlock login node.
I turned test-openblas into a unit test.
Here's the output on my Mac, running with Numpy's embedded OpenBLAS:
$ pytest wholecell/tests/utils/test_openblas_threads.py
...
_____________________________________________________________________ test_openblas _____________________________________________________________________
def test_openblas():
"""Compare a dot product running with various numbers of OpenBLAS threads."""
...
> np.testing.assert_allclose(np_products, reference, rtol=1e-17)
E AssertionError:
E Not equal to tolerance rtol=1e-17, atol=0
E
E Mismatched elements: 11 / 12 (91.7%)
E Max absolute difference: 1.38777878e-16
E Max relative difference: 8.31811887e-15
E x: array([0.016684, 0.016684, 0.016684, 0.016684, 0.016684, 0.016684,
E 0.016684, 0.016684, 0.016684, 0.016684, 0.016684, 0.016684])
E y: array([0.016684, 0.016684, 0.016684, 0.016684, 0.016684, 0.016684,
E 0.016684, 0.016684, 0.016684, 0.016684, 0.016684, 0.016684])
wholecell/tests/utils/test_openblas_threads.py:46: AssertionError
----------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------
THREADS DOT PRODUCT DIFF
1 0.016683805584112754 0.0
2 0.016683805584112615 -1.3877787807814457e-16
3 0.016683805584112702 -5.204170427930421e-17
4 0.016683805584112643 -1.1102230246251565e-16
5 0.016683805584112643 -1.1102230246251565e-16
6 0.016683805584112647 -1.0755285551056204e-16
7 0.016683805584112647 -1.0755285551056204e-16
8 0.016683805584112647 -1.0755285551056204e-16
9 0.016683805584112647 -1.0755285551056204e-16
10 0.016683805584112647 -1.0755285551056204e-16
11 0.016683805584112647 -1.0755285551056204e-16
12 0.016683805584112647 -1.0755285551056204e-16
Nice! I guess now we just need to check versions of openblas that would produce the right results and then we can add that unit test in. We could also check the number of cpus available and only run up to that number instead of running 12 each time.
I made a small update to the branch to include a test with OPENBLAS_NUM_THREADS being unset and only up to the number of available CPUs
Cool.
I'm testing this in various environments, so far just on my Mac:
Symbol not found: ___addtf3
. I don't know if that's somehow a consequence of recreating a python 2.7 environment recently or if it never worked. Various cleanup steps like deleting the pip cache haven't helped.)@U8NWXD were your tests on GCloud in a Docker Container? What was OPENBLAS_NUM_THREADS
set to?
Q: Parallelizing vector dot presumably divides the two vectors into one chunk per thread and adds their portions of the dot product. Floating point rounding will thus vary with the number of chunks. How much variation should be acceptable? Where to set the np.testing.assert_allclose(rtol=?, atol=?)
?
Some measurements:
[1] (wcEcoli3)
THREADS DOT PRODUCT DIFF
1 0.016683805584112754 0.0
2 0.016683805584112615 -1.3877787807814457e-16
3 0.016683805584112702 -5.204170427930421e-17
4 0.016683805584112643 -1.1102230246251565e-16
5 0.016683805584112643 -1.1102230246251565e-16
6 0.016683805584112647 -1.0755285551056204e-16
7 0.016683805584112647 -1.0755285551056204e-16
8 0.016683805584112647 -1.0755285551056204e-16
9 0.016683805584112647 -1.0755285551056204e-16
10 0.016683805584112647 -1.0755285551056204e-16
11 0.016683805584112647 -1.0755285551056204e-16
12 0.016683805584112647 -1.0755285551056204e-16
[2] Quitting Docker Desktop didn't change that at all, even though it might've freed up some CPUs.
[3] (wcEcoli2)
1 0.016683805584112754 0.000000000000000000
2 0.016683805584112615 -0.000000000000000139
3 0.016683805584112702 -0.000000000000000052
4 0.016683805584112643 -0.000000000000000111
5 0.016683805584112643 -0.000000000000000111
6 0.016683805584112647 -0.000000000000000108
7 0.016683805584112647 -0.000000000000000108
8 0.016683805584112647 -0.000000000000000108
9 0.016683805584112647 -0.000000000000000108
10 0.016683805584112647 -0.000000000000000108
11 0.016683805584112647 -0.000000000000000108
12 0.016683805584112647 -0.000000000000000108
[4] (wcEcoli5) on Python 3.8.5; I don't remember whether it has numpy & scipy installed from binary vs. source linked to brew openblas. I deleted it when trying to rebuild wcEcoli3 but will recreate it.
1 0.016683805584112754 0.000000000000000000
2 0.016683805584112615 -0.000000000000000139
3 0.016683805584112702 -0.000000000000000052
4 0.016683805584112643 -0.000000000000000111
5 0.016683805584112643 -0.000000000000000111
6 0.016683805584112647 -0.000000000000000108
7 0.016683805584112647 -0.000000000000000108
8 0.016683805584112647 -0.000000000000000108
9 0.016683805584112647 -0.000000000000000108
10 0.016683805584112647 -0.000000000000000108
11 0.016683805584112647 -0.000000000000000108
12 0.016683805584112647 -0.000000000000000108
[5] # In wcm Docker Image, with NO_AVX2=1
THREADS DOT PRODUCT DIFF
1 0.016683805584112754 0.0
2 0.016683805584112615 -1.3877787807814457e-16
3 0.016683805584112702 -5.204170427930421e-17
4 0.016683805584112643 -1.1102230246251565e-16
5 0.016683805584112643 -1.1102230246251565e-16
6 0.016683805584112702 -5.204170427930421e-17
7 0.016683805584112702 -5.204170427930421e-17
8 0.016683805584112702 -5.204170427930421e-17
9 0.016683805584112702 -5.204170427930421e-17
10 0.016683805584112702 -5.204170427930421e-17
11 0.016683805584112702 -5.204170427930421e-17
12 0.016683805584112702 -5.204170427930421e-17
[6] wcEcoli3 rebuilt with numpy, scipy no-binary, linked to brew-installed openblas
No success yet installing a working numpy this way.
[7] wcEcoli3 rebuilt with numpy, scipy no-binary, linked to manually-compiled openblas with NO_AVX2=1
TODO
Q: Parallelizing vector dot presumably divides the two vectors into one chunk per thread and adds their portions of the dot product. Floating point rounding will thus vary with the number of chunks. How much variation should be acceptable? Where to set the np.testing.assert_allclose(rtol=?, atol=?)?
Maybe this is just a problem that can't be solved if we allow parallelization of numpy if we'll always have issues with floating point rounding. I think the main issue arises from the stochastic nature of our sims. Any bit of floating point imprecision could lead to a slightly different random result which produces one molecule instead of another which can greatly amplify the difference (even if it's only off by 1e-17 but produces an essential protein that is needed vs some other protein). I think for that reason it would need to have 0 tolerance for the difference. wcEcoli2 on Sherlock produced the same results no matter how many threads were specified but that might just be because the OpenBLAS library was installed to only use 1 thread and not actually an issue with any of the other OpenBLAS versions tried. I've run with OPENBLAS_NUM_THREADS=1 for a long time which is why I could always reproduce the issues from Jenkins run on Sherlock with wcEcoli2.
@1fish2 I was not using Docker, and OPENBLAS_NUM_THREADS was not set. The number of threads available to openblas could have explained the different behavior
Agreed. We might have to limit to OPENBLAS_NUM_THREADS=1
. Is it sufficient to make that bold in the pyenv setup instructions and tell everyone? Print a warning if the value is different in the parca and sim Firetasks?
[6] numpy cannot load the brew-installed openblas on my Mac despite many tries to link them up. I don't know if that's specific to this Mac after messing with gcc@9 et al to reproduce the Python 2 runtime environment. It makes the Docker approach more attractive.
[7] wcEcoli5 on Python 3.8.5 rebuilt with numpy, scipy no-binary, linked to manually-compiled
openblas with "NO_AVX2=1" (and a table tweak)
THREADS DOT PRODUCT DIFF FROM 1 THREAD
1 0.016683805584112754 0.0
2 0.016683805584112615 -1.3877787807814457e-16
3 0.016683805584112702 -5.204170427930421e-17
4 0.016683805584112643 -1.1102230246251565e-16
5 0.016683805584112643 -1.1102230246251565e-16
6 0.016683805584112702 -5.204170427930421e-17
7 0.016683805584112702 -5.204170427930421e-17
8 0.016683805584112702 -5.204170427930421e-17
9 0.016683805584112702 -5.204170427930421e-17
10 0.016683805584112702 -5.204170427930421e-17
11 0.016683805584112702 -5.204170427930421e-17
12 0.016683805584112702 -5.204170427930421e-17
0.016683805584112702 -5.204170427930421e-17
[8] ditto but "NO_AVX2=0"
THREADS DOT PRODUCT DIFF FROM 1 THREAD
1 0.016683805584112754 0.0
2 0.016683805584112615 -1.3877787807814457e-16
3 0.016683805584112702 -5.204170427930421e-17
4 0.016683805584112643 -1.1102230246251565e-16
5 0.016683805584112643 -1.1102230246251565e-16
6 0.016683805584112647 -1.0755285551056204e-16
7 0.016683805584112647 -1.0755285551056204e-16
8 0.016683805584112647 -1.0755285551056204e-16
9 0.016683805584112647 -1.0755285551056204e-16
10 0.016683805584112647 -1.0755285551056204e-16
11 0.016683805584112647 -1.0755285551056204e-16
12 0.016683805584112647 -1.0755285551056204e-16
0.016683805584112647 -1.0755285551056204e-16
"NO_AVX2=1" produced identical results with 1-5 threads and slightly closer results with 6+ threads. What does that imply? It's not the wildly wrong results that made us set "NO_AVX2=1" for Docker for Mac. I guess using the vectorization hardware entails breaking the chunks into smaller vectorization subchunks.
(FWIW, 3 runs each at "NO_AVX2=1" and "NO_AVX2=0" took essentially the same amount of elapsed time, but this duration is probably dominated by process exec/quit, piping, and printing.)
BTW the embedded openblas in numpy & scipy gets the same results in this test as the manually-compiled openblas with "NO_AVX2=1" and in the same amount of time.
[9] Results on Mac in pyenv wcEcoli2 (with tweaked floating point formatting):
THREADS DOT PRODUCT DIFF FROM 1 THREAD
1 0.016683805584112754 0
2 0.016683805584112615 -1.3877787807814457e-16
3 0.016683805584112702 -5.2041704279304213e-17
4 0.016683805584112643 -1.1102230246251565e-16
5 0.016683805584112643 -1.1102230246251565e-16
6 0.016683805584112647 -1.0755285551056204e-16
7 0.016683805584112647 -1.0755285551056204e-16
8 0.016683805584112647 -1.0755285551056204e-16
9 0.016683805584112647 -1.0755285551056204e-16
10 0.016683805584112647 -1.0755285551056204e-16
11 0.016683805584112647 -1.0755285551056204e-16
12 0.016683805584112647 -1.0755285551056204e-16
0.016683805584112647 -1.0755285551056204e-16
[10] on a 4-CPU Sherlock node in pyenv wcEcoli2 (ditto wcEcoli-paper):
THREADS DOT PRODUCT DIFF FROM 1 THREAD
1 0.016683805584112754 0
2 0.016683805584112754 0
3 0.016683805584112754 0
4 0.016683805584112754 0
0.016683805584112754 0
[11] in pyenv wcEcoli3:
THREADS DOT PRODUCT DIFF FROM 1 THREAD
1 0.016683805584112754 0
2 0.016683805584112615 -1.3877787807814457e-16
3 0.016683805584112702 -5.2041704279304213e-17
4 0.016683805584112643 -1.1102230246251565e-16
0.016683805584112643 -1.1102230246251565e-16
Hypotheses:
Q. Is this OpenBLAS THREADs difference the only cause of parca differences?
@tahorst did this test data come from a parca run?
More tests on Mac We can rule out a change to OpenBLAS, Numpy, or Python (except maybe combinations):
[12] Python 3.8.3, numpy==1.19.0, openblas-0.3.5:
THREADS DOT PRODUCT DIFF FROM 1 THREAD
1 0.016683805584112754 0
2 0.016683805584112615 -1.3877787807814457e-16
3 0.016683805584112702 -5.2041704279304213e-17
4 0.016683805584112643 -1.1102230246251565e-16
5 0.016683805584112643 -1.1102230246251565e-16
6 0.016683805584112702 -5.2041704279304213e-17
7 0.016683805584112702 -5.2041704279304213e-17
8 0.016683805584112702 -5.2041704279304213e-17
9 0.016683805584112702 -5.2041704279304213e-17
10 0.016683805584112702 -5.2041704279304213e-17
11 0.016683805584112702 -5.2041704279304213e-17
12 0.016683805584112702 -5.2041704279304213e-17
0.016683805584112702 -5.2041704279304213e-17
[13] Python 3.8.3, numpy==1.14.6, openblas-0.3.5:
THREADS DOT PRODUCT DIFF FROM 1 THREAD
1 0.016683805584112754 0
2 0.016683805584112615 -1.3877787807814457e-16
3 0.016683805584112702 -5.2041704279304213e-17
4 0.016683805584112643 -1.1102230246251565e-16
5 0.016683805584112643 -1.1102230246251565e-16
6 0.016683805584112702 -5.2041704279304213e-17
7 0.016683805584112702 -5.2041704279304213e-17
8 0.016683805584112702 -5.2041704279304213e-17
9 0.016683805584112702 -5.2041704279304213e-17
10 0.016683805584112702 -5.2041704279304213e-17
11 0.016683805584112702 -5.2041704279304213e-17
12 0.016683805584112702 -5.2041704279304213e-17
0.016683805584112702 -5.2041704279304213e-17
[14] Python 2.7.16, numpy==1.14.6, openblas-0.3.5:
THREADS DOT PRODUCT DIFF FROM 1 THREAD
1 0.016683805584112754 0
2 0.016683805584112615 -1.3877787807814457e-16
3 0.016683805584112702 -5.2041704279304213e-17
4 0.016683805584112643 -1.1102230246251565e-16
5 0.016683805584112643 -1.1102230246251565e-16
6 0.016683805584112702 -5.2041704279304213e-17
7 0.016683805584112702 -5.2041704279304213e-17
8 0.016683805584112702 -5.2041704279304213e-17
9 0.016683805584112702 -5.2041704279304213e-17
10 0.016683805584112702 -5.2041704279304213e-17
11 0.016683805584112702 -5.2041704279304213e-17
12 0.016683805584112702 -5.2041704279304213e-17
0.016683805584112702 -5.2041704279304213e-17
@tahorst are you getting similar results on your local machine?
Did it ever produce consistent results on macOS with varying OPENBLAS_NUM_THREADS
?
Hypotheses left?
Q. Is this OpenBLAS THREADs difference the only cause of parca differences? did this test data come from a parca run?
I first noticed a difference between Sherlock and local runs when I couldn't reproduce the Jenkins failures locally. I checked the parca output and it showed differences. Then there were some updates (newer Theano version and other changes) and it looked like parca output was the same but sim output differed so I don't think there are actually any parca differences any more (it could be worth it to confirm again with THREADS unset). This test data is actually from a sim calculating the mass update for a time step.
are you getting similar results on your local machine?
I get the same results for all local environments tested and it looks like it matches you up to your 5 threads:
Python 3.8.3, numpy==1.19.0, openblas default from numpy Python 3.8.3, numpy==1.19.0, openblas-0.3.5 Python 2.7.16, numpy==1.14.6, openblas-0.3.5
THREADS DOT PRODUCT DIFF FROM 1 THREAD
1 0.016683805584112754 0
2 0.016683805584112615 -1.3877787807814457e-16
3 0.016683805584112702 -5.2041704279304213e-17
4 0.016683805584112643 -1.1102230246251565e-16
5 0.016683805584112643 -1.1102230246251565e-16
6 0.016683805584112647 -1.0755285551056204e-16
7 0.016683805584112532 -2.2204460492503131e-16
8 0.016683805584112532 -2.2204460492503131e-16
9 0.016683805584112647 -1.0755285551056204e-16
10 0.016683805584112643 -1.1102230246251565e-16
11 0.016683805584112671 -8.3266726846886741e-17
12 0.016683805584112536 -2.1857515797307769e-16
0.016683805584112536 -2.1857515797307769e-16
I'm thinking at this point we just make sure we set OPENBLAS_NUM_THREADS=1 (or install openblas with an option for only one thread). It seems like the OpenBLAS library linked on Sherlock for wcEcoli2 (0.3.9) is the only one that is independent from the threads specified. Did you test OpenBLAS version 0.3.9 in any other setups?
I got the unit test pass locally by installing a single thread version of OpenBLAS (instructions from Sherlock with USE_THREAD=0
added) and linking numpy to that version with pip install numpy==1.19.0 --no-binary numpy
and proper .numpy-site.cfg paths:
git clone https://github.com/xianyi/OpenBLAS
cd OpenBLAS/
git checkout v0.3.9
USE_THREAD=0 make FC=gfortran
make PREFIX=/opt/OpenBLAS-0.3.9 install
I think we can either specify instructions to install OpenBLAS with a single thread or use OPENBLAS_NUM_THREADS=1 and that should fix this issue.
Did you test OpenBLAS version 0.3.9 in any other setups?
Yes, and lately I've been using 0.3.10 on my Mac and in wcm-runtime/Dockerfile
, including (IIRC) in the test runs above.
Initially we had to clone the OpenBLAS repo to get an unreleased bug fix. Now there are downloadable releases up to 0.3.10.
Travis found that building OpenBLAS with NO_AVX2=1
produces different results, so it's better to avoid that when possible, and building on Linux with NO_AVX2=0
inside a Dockerfile produces an Image that ran just fine on Mac! Let's test that with a Cloud Build and if it all works, edit cloud/docker/runtime/Dockerfile
and related files.
Testing a cloud-built wcm-code using the test-openblas
branch plus these adjustments not checked in:
NO_AVX2=0
numpy==1.19.1 scipy==1.5.2
(I forgot what bug fixes led me to pick up these "dot-dot" updates)(1) test_openblas_threads.py
gets results consistent with pyenv wcEcoli2 on Sherlock:
THREADS DOT PRODUCT DIFF FROM 1 THREAD
1 0.016683805584112754 0
2 0.016683805584112754 0
3 0.016683805584112754 0
4 0.016683805584112754 0
5 0.016683805584112754 0
6 0.016683805584112754 0
0.016683805584112754 0
That further validates Travis' idea that we can build Docker Images with NO_AVX2=0
on Linux and still run on Mac, and it disproves the OpenBLAS team's hypothesis about where the AVX2 problem is.
(2) That's surprising since the macOS host outside Docker was not getting consistent results (see above) but -- whoa! -- it does now! I do not know change made that happen but testing with a Python 3.8.5 virtualenv that has numpy==1.19.0
shows it's not numpy update.
(3) Inside Docker-for-Mac without the --user $(id -u):$(id -g)
option, $(id -u):$(id -g)
is 0:0
; in the host it's 501:20
, and yet it creates files with the right host-side owner in a mapped directory! The --user
option was a workaround we needed in Google Compute Engine, but maybe it's irrelevant in Docker Desktop.
@tahorst, does this happen in Docker-for-Windows?
(4) I tried keeping the openblas build directory, then running make tests
in that directory inside the container. It doesn't get far:
touch libopenblas_haswellp-r0.3.10.a
make -j 6 -C test all
make[1]: Entering directory '/openblas/OpenBLAS-0.3.10/test'
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -o sblat1 sblat1.o ../libopenblas_haswellp-r0.3.10.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-linux-gnu/8 -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/8/../../.. -lc
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -o dblat1 dblat1.o ../libopenblas_haswellp-r0.3.10.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-linux-gnu/8 -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/8/../../.. -lc
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -o cblat1 cblat1.o ../libopenblas_haswellp-r0.3.10.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-linux-gnu/8 -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/8/../../.. -lc
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -o zblat1 zblat1.o ../libopenblas_haswellp-r0.3.10.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-linux-gnu/8 -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/8/../../.. -lc
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -o sblat2 sblat2.o ../libopenblas_haswellp-r0.3.10.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-linux-gnu/8 -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/8/../../.. -lc
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -o dblat2 dblat2.o ../libopenblas_haswellp-r0.3.10.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-linux-gnu/8 -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/8/../../.. -lc
///usrusrusr//binbin///binldld/::ld :.. .././libopenblas_haswellp.libopenblas_haswellp-/-r0.3.10.alibopenblas_haswellp:r0.3.10.a -:filer0.3.10.a :notfile filerecognizednot : not/ recognized :file usrfilerecognized /truncatedtruncated:bin
/fileld :truncated
.collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
./libopenblas_haswellp-r0.3.10.a: file not recognized: file truncated
collect2: error: ld returned 1 exit status
...
(5) runParca gets nearly done then prints:
Fitting promoter binding
Killed
and exits with code 137.
==> I had to adjust Docker Desktop Preferences/Resources/Memory from 2GB to 4GB. (Would 3GB suffice?) I'm adding a note to the README.
(6) compareParca.py
reports that the parca output from this Docker container is the same as with numpy 1.19.0 but has lots of numeric differences from Mac outside Docker.
(2) That's surprising since the macOS host outside Docker was not getting consistent results (see above) but -- whoa! -- it does now! I do not know change made that happen but testing with a Python 3.8.5 virtualenv that has numpy==1.19.0 shows it's not numpy update.
The results were inside a Docker container? Do you think there was only 1 CPU available when it was built in the cloud so when compiling OpenBLAS, the max number of threads was set to 1?
@tahorst, does this happen in Docker-for-Windows?
Docker-for-Windows actually runs through WSL2 which is actually a Linux kernel so unsurprisingly it gets the same behavior that we see with other Linux environments and writes as root if the --user
option isn't supplied.
(4) I tried keeping the openblas build directory, then running make tests in that directory inside the container. It doesn't get far:
That's for the new image run on Mac? You still get good results when running the parca/sim though?
==> I had to adjust Docker Desktop Preferences/Resources/Memory from 2GB to 4GB. (Would 3GB suffice?) I'm adding a note to the README.
There's actually already a note about this in docs/README.md:
NOTE: If you encounter memory issues while using Docker Desktop (the default allocated memory is 2GB) causing the simulation to get killed midway, click the Docker icon > Preferences > Advanced > adjust memory to 4GB.
(2) That's surprising since the macOS host outside Docker was not getting consistent results (see above) but -- whoa! -- it does now! I do not know change made that happen but testing with a Python 3.8.5 virtualenv that has numpy==1.19.0 shows it's not numpy update.
The results were inside a Docker container?
Results (1) were inside the container (where it thinks there are 6 CPUs -- default Docker allocation), while results (2) were outside the container (where it thinks there are 12 CPUs).
Do you think there was only 1 CPU available when it was built in the cloud so when compiling OpenBLAS, the max number of threads was set to 1?
Could be. Any idea how to tell by examining the /openblas/OpenBLAS-0.3.10
build directory or the libopenblas_haswell-r0.3.10.so
file?
4) I tried keeping the openblas build directory, then running
make tests
in that directory inside the container. It doesn't get far:That's for the new image run on Mac?
Yes.
You still get good results when running the parca/sim though?
I just started 2 sim gens. We'll see.
There's actually already a note about this in docs/README.md:
Arg. I'll change it to a setup step so it's harder to miss.
Could be. Any idea how to tell by examining the /openblas/OpenBLAS-0.3.10 build directory or the libopenblas_haswell-r0.3.10.so file?
Do you have the build log? It should say if it was single threaded when it was compiled or how many CPUs it was configured for. Not sure how to get that info after compilation...
The build log confirms that OpenBLAS was single threaded: libopenblas_haswell-r0.3.10.a (Single-threading)
.
I'll test USE_THREAD=0 on Mac to see if that gets results consistent with the others.
In the previous tests inside and outside Docker, the parca and sim outs differ although I didn't run analysis plots yet. Here are the behavior metrics, if that helps judge for good results:
docker-numpy-1.19.1:
metric mode expected_min expected_max value passes
0 dnaMass end_start_ratio 1.98150 1.981600 1.98511 False
1 doublingTime doubling_time 2910.00000 2920.000000 2950 False
2 fluxomeCorrelation correlation 0.63497 0.634980 0.658865 False
3 limiting_metabolites limiting_metabolites NaN NaN {} False
4 limiting_metabolites num_limiting_metabolites 2.00000 2.000000 0 False
5 mRnaMass end_start_ratio 2.43290 2.433000 2.18449 False
6 proteinMass end_start_ratio 2.11340 2.113500 2.12372 False
7 proteinMass max 342.72000 342.730000 344.604 False
8 proteinMass mean 241.33000 241.340000 242.366 False
9 proteinMass min 162.16000 162.170000 162.264 False
10 proteinMass stdev 51.85900 51.860000 52.4276 False
11 proteome_correlation schmidt_correlation 0.75000 0.750010 0.746113 False
12 proteome_correlation wisniewski_correlation 0.61569 0.615700 0.612515 False
13 rRnaMass end_start_ratio 2.25150 2.251600 2.22151 False
14 replication_elongation_rate mean 1932.40000 1932.500000 1932.46 True
15 ribosome mean_active_fraction 0.86770 0.867702 0.870178 False
16 ribosome mean_concentration 16186.00000 16187.000000 15907.1 False
17 rnaPolymerase mean_active_fraction 0.24678 0.246790 0.24749 False
18 rnaPolymerase mean_concentration 3294.60000 3294.700000 3194.87 False
19 smallMoleculeMass end_start_ratio 2.07820 2.078300 2.07858 False
20 tRnaMass end_start_ratio 2.28240 2.282500 2.25283 False
21 transcriptionElongationRate mean 59155.00000 59156.000000 57590.1 False
22 translationElongationRate mean 15.79300 15.794000 16.035 False
mac-numpy-1.19.1:
metric mode expected_min expected_max value passes
0 dnaMass end_start_ratio 1.98150 1.981600 1.97915 False
1 doublingTime doubling_time 2910.00000 2920.000000 2944 False
2 fluxomeCorrelation correlation 0.63497 0.634980 0.658657 False
3 limiting_metabolites limiting_metabolites NaN NaN {VAL[c], GLT[c]} True
4 limiting_metabolites num_limiting_metabolites 2.00000 2.000000 2 True
5 mRnaMass end_start_ratio 2.43290 2.433000 2.17393 False
6 proteinMass end_start_ratio 2.11340 2.113500 2.12721 False
7 proteinMass max 342.72000 342.730000 345.171 False
8 proteinMass mean 241.33000 241.340000 242.583 False
9 proteinMass min 162.16000 162.170000 162.264 False
10 proteinMass stdev 51.85900 51.860000 52.5483 False
11 proteome_correlation schmidt_correlation 0.75000 0.750010 0.750456 False
12 proteome_correlation wisniewski_correlation 0.61569 0.615700 0.615472 False
13 rRnaMass end_start_ratio 2.25150 2.251600 2.21675 False
14 replication_elongation_rate mean 1932.40000 1932.500000 1932.46 True
15 ribosome mean_active_fraction 0.86770 0.867702 0.872894 False
16 ribosome mean_concentration 16186.00000 16187.000000 15693.1 False
17 rnaPolymerase mean_active_fraction 0.24678 0.246790 0.246992 False
18 rnaPolymerase mean_concentration 3294.60000 3294.700000 3195.17 False
19 smallMoleculeMass end_start_ratio 2.07820 2.078300 2.07505 False
20 tRnaMass end_start_ratio 2.28240 2.282500 2.25239 False
21 transcriptionElongationRate mean 59155.00000 59156.000000 57537.5 False
22 translationElongationRate mean 15.79300 15.794000 16.3066 False
USE_THREAD=0
on Mac produced lots of parca output differences than without that flag, also different than the docker runs.USE_THREAD=0
runs in a row on Mac produced matching parca outputs.@tahorst do you have a test case for different output with NO_AVX2=1
?
What kind of output do you want for the test case?
Maybe more than one test case.
NO_AVX2
test -- along with the OPENBLAS_NUM_THREADS
test and maybe more -- to tell if we've fixed specific problems that caused inconsistent model output. Cumulatively, these should help us find more inconsistencies and eventually predict whether the model will produce consistent output across repeat runs and wherever it runs (OS, CPU, gcc, laptop, desktop, Sherlock, GCE, inside/outside Docker).NO_AVX2
changes the results. Does that vary by where we run it?I'm not sure exactly where the differences are coming from so I don't know if we can create a simple test case for the NO_AVX2
difference but can look into it.
We could brainstorm debugging techniques. E.g. run w/ and w/o NO_AVX2
in different Docker containers and compare a sequence of data snapshots.
You're thinking of looking at different points when the parca is running? My guess from the diffs that show up when comparing the final sim_data objects from the different Docker containers w/ and w/o NO_AVX2
is that the difference might be coming when we fit gene expression so it could be the Theano functions or the solver there but I would have to look into it to confirm. Comparing sim_data before and after that in different containers to confirm could be a good way to isolate a simpler test case.
Good idea.
Top level questions:
Findings on Sherlock (pyenv wcEcoli3-staging):
~/.numpy-site.cfg
file needs a runtime_library_dirs = ...
line to install numpy in a way that it reliably loads the named openblas at runtime.
LDFLAGS="-shared $LDFLAGS"
, otherwise the compiler will fail with "crt1.o in function _start undefined reference to main".OPENBLAS_NUM_THREADS=1
) it runs test_library_performance
faster than a 5 year old compute node with 1 allocated CPU.Findings in Docker:
NO_AVX2=0
in Docker-for-Linux, NO_AVX2=1
in Docker-for-Mac, regardless of where it'll run, because NO_AVX2=0
computes more consistent results and runs fine in Docker-for-Mac (contrary the openblas team's hypothesis).runtime_library_dirs
for good measure and as a better reference for manual pyenv setup?Findings on Mac & Linux?
Also:
OPENBLAS_NUM_THREADS=1
for more consistent results (see test_openblas_threads.py
) and faster performance. It's possible to avoid that need by compiling openblas with USE_THREAD=0
.The different ways to install OpenBLAS and NumPy × different runtime environments produce at least 7 different equivalence classes of Parca outputs. There does not seem to be an installation approach (even if the instructions are environment-specific) to get cross-platform consistent Parca output.
See https://github.com/xianyi/OpenBLAS/issues/2244#issuecomment-696510557
It could be useful to investigate where these Parca computations diverge, but I'm not inclined to spend the time on that.
The bug fix #980 for portability issue #979 raises this question: Is numpy's random number generator returning different results between @ggsun's local machine and Sherlock?
What could cause that?
There are uses of the default random instance in prototypes/, tests, and sim variants (those should be OK), analysis scripts (problematic when analyses run in parallel but won't affect the sims), and polymerize (that's concerning).
When moving to Python 3, we also moved from numpy 1.14.6 to 1.19.2. numpy.random
did change in 1.17.0 (see NEP 19), mainly by adding a new generator PCG64 "which has better statistical properties than the legacy MT19937 used in RandomState". (It's faster, too.)
NEP 19 also says RandomState
overpromised on compatibility.
runSim --intermediates-directory
and comparePickles
are useful tools to help localize this problem.
Also see #1097
The CACM article Keeping Science on Keel When Software Moves describes techniques and tools that a climate modeling project uses to debug reproducibility problems in their large models.
The article lacks key details, but the FLiT tool's technique as described might be relevant.
Also, all the examples in the article came down to compile optimizations that used Fused Multiply-Add (FMA) floating point instructions.
It's nice to see there's a tool to test for the floating point issues and it's interesting that other people have run into these problems. Do you think there are compiler flags for BLAS (I'm assuming that's where this is coming from but could be other dependencies) that make it slightly less efficient but reproducible? I also wonder if we can truncate/round floating point results in our code where these differences come up so our precision is less but we get above the machine level noise.
It'd be good to brainstorm on this problem.
More observations:
corrected the Haswell DROT kernel to require AVX2/FMA3 rather than AVX512
.I also wonder if we can truncate/round floating point results in our code where these differences come up so our precision is less but we get above the machine level noise.
Interesting idea! It might require localizing the cause to particular pieces of code. Then it might be able to erase some hardware variability and some order-of-evaluation variability.
This option sounds promising for performance in Docker but riskier for reproducibility:
OpenBLAS can be built for multiple targets with runtime detection of the target cpu by specifiying
DYNAMIC_ARCH=1
in Makefile.rule, on the gmake command line or as-DDYNAMIC_ARCH=TRUE
in cmake.
Another variable is the C++ compiler: clang on macOS vs. gcc on Ubuntu, Docker, and CentOS (Sherlock). We could set up clang (how?) for use by pyenv, pip, Aesara, Cython, & Numba.
What triggered this realization is Santiago and Niels hit a problem on Ubuntu where Parca calls Aesara which fails to compile & link some code. A workaround is to install Python using PYTHON_CONFIGURE_OPTS="--enable-shared" pyenv install 3.8.7
.
I noticed that Sherlock output with python3 does not match my local output. This starts with small floating point differences in nearly all expression related items in
sim_data
. Does anyone else see a difference in parca output on their local environment vs Sherlock? I was usingpython runscripts/debug/compareParca.py out/local out/sherlock
to compare local output to Sherlock output that I copied locally. Just wondering if it's an installation difference or a machine difference.