CovertLab / wcEcoli

Whole Cell Model of E. coli
Other
18 stars 4 forks source link

Different parca output on Sherlock vs local #931

Open tahorst opened 4 years ago

tahorst commented 4 years ago

I noticed that Sherlock output with python3 does not match my local output. This starts with small floating point differences in nearly all expression related items in sim_data. Does anyone else see a difference in parca output on their local environment vs Sherlock? I was using python runscripts/debug/compareParca.py out/local out/sherlock to compare local output to Sherlock output that I copied locally. Just wondering if it's an installation difference or a machine difference.

U8NWXD commented 4 years ago

I've found simData from runParca.py to differ between my local machine and google cloud machines, and between different google cloud machines (one was running Debian, the other Ubuntu). That was true even before the py3 migration

1fish2 commented 4 years ago

Hypotheses:

  1. Different libraries installed, or compiled differently, e.g. Sherlock has an older version of OpenBLAS.
  2. Hardware- and OS-dependencies.
  3. Something like psuedo-random number seeding in our code that gets different results on different machines or even different runs.
tahorst commented 4 years ago

Serial vs parallel results were the same locally and on Sherlock so it seems to be reproducible on a given machine, which is good, but not reproducible between. It is interesting that this showed up before py3 on google cloud but it was at least consistent on my machine vs sherlock before the switch. Certainly some good hypotheses Jerry!

tahorst commented 4 years ago

I checked again and it looks like we might have consistent parca outputs (maybe it was a library issue with the recent update to Theano) but there is still a sim diff between the two. It looks like we've run into another OpenBLAS thread issue. I pushed a branch that does a dot product (arrays were pulled from a sim that was producing different results) and produces different results depending on the OPENBLAS_NUM_THREADS environment variable. I'm not sure if there's an easy way to turn this into a unit test since I don't think we can change the number of threads after loading numpy but this can at least show the problem.

Local machine:

$ cd test-openblas
$ for c in {1..12}; do OPENBLAS_NUM_THREADS=$c ./test.py; done
0.016683805584112754
0.016683805584112615
0.016683805584112702
0.016683805584112643
0.016683805584112643
0.016683805584112647
0.016683805584112532
0.016683805584112532
0.016683805584112647
0.016683805584112643
0.01668380558411267
0.016683805584112536

Sherlock with wcEcoli3 and 8 cpus:

$ for c in {1..12}; do OPENBLAS_NUM_THREADS=$c ./test.py; done
0.016683805584112754
0.016683805584112615
0.016683805584112702
0.016683805584112643
0.016683805584112643
0.016683805584112647
0.016683805584112532
0.016683805584112532
0.016683805584112532
0.016683805584112532
0.016683805584112532
0.016683805584112532

The old pyenv (wcEcoli2) on Sherlock gave consistent results:

$ for c in {1..12}; do OPENBLAS_NUM_THREADS=$c ./test.py; done
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
0.016683805584112754
1fish2 commented 4 years ago

Awesome! A compact, repeatable test case is usually the hardest step in debugging.

We could probably make a unit test that uses multiprocessing to exec a process with a different OPENBLAS_NUM_THREADS.

The environment variable NUM_THREADS supposedly affects compiling OpenBLAS by setting the max number of threads. Otherwise it tries to figure that out from the hardware it compiles on, which means we should be careful to not compile it on a Sherlock login node.

1fish2 commented 4 years ago

I turned test-openblas into a unit test.

Here's the output on my Mac, running with Numpy's embedded OpenBLAS:

$ pytest wholecell/tests/utils/test_openblas_threads.py
...

_____________________________________________________________________ test_openblas _____________________________________________________________________

    def test_openblas():
        """Compare a dot product running with various numbers of OpenBLAS threads."""
        ...
>       np.testing.assert_allclose(np_products, reference, rtol=1e-17)
E    AssertionError:
E    Not equal to tolerance rtol=1e-17, atol=0
E
E    Mismatched elements: 11 / 12 (91.7%)
E    Max absolute difference: 1.38777878e-16
E    Max relative difference: 8.31811887e-15
E     x: array([0.016684, 0.016684, 0.016684, 0.016684, 0.016684, 0.016684,
E           0.016684, 0.016684, 0.016684, 0.016684, 0.016684, 0.016684])
E     y: array([0.016684, 0.016684, 0.016684, 0.016684, 0.016684, 0.016684,
E           0.016684, 0.016684, 0.016684, 0.016684, 0.016684, 0.016684])

wholecell/tests/utils/test_openblas_threads.py:46: AssertionError
----------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------
THREADS               DOT PRODUCT                      DIFF
      1      0.016683805584112754                       0.0
      2      0.016683805584112615   -1.3877787807814457e-16
      3      0.016683805584112702    -5.204170427930421e-17
      4      0.016683805584112643   -1.1102230246251565e-16
      5      0.016683805584112643   -1.1102230246251565e-16
      6      0.016683805584112647   -1.0755285551056204e-16
      7      0.016683805584112647   -1.0755285551056204e-16
      8      0.016683805584112647   -1.0755285551056204e-16
      9      0.016683805584112647   -1.0755285551056204e-16
     10      0.016683805584112647   -1.0755285551056204e-16
     11      0.016683805584112647   -1.0755285551056204e-16
     12      0.016683805584112647   -1.0755285551056204e-16
tahorst commented 4 years ago

Nice! I guess now we just need to check versions of openblas that would produce the right results and then we can add that unit test in. We could also check the number of cpus available and only run up to that number instead of running 12 each time.

tahorst commented 4 years ago

I made a small update to the branch to include a test with OPENBLAS_NUM_THREADS being unset and only up to the number of available CPUs

1fish2 commented 4 years ago

Cool.

I'm testing this in various environments, so far just on my Mac:

@U8NWXD were your tests on GCloud in a Docker Container? What was OPENBLAS_NUM_THREADS set to?

Q: Parallelizing vector dot presumably divides the two vectors into one chunk per thread and adds their portions of the dot product. Floating point rounding will thus vary with the number of chunks. How much variation should be acceptable? Where to set the np.testing.assert_allclose(rtol=?, atol=?)?

Some measurements:

[1] (wcEcoli3)
THREADS               DOT PRODUCT                      DIFF
      1      0.016683805584112754                       0.0
      2      0.016683805584112615   -1.3877787807814457e-16
      3      0.016683805584112702    -5.204170427930421e-17
      4      0.016683805584112643   -1.1102230246251565e-16
      5      0.016683805584112643   -1.1102230246251565e-16
      6      0.016683805584112647   -1.0755285551056204e-16
      7      0.016683805584112647   -1.0755285551056204e-16
      8      0.016683805584112647   -1.0755285551056204e-16
      9      0.016683805584112647   -1.0755285551056204e-16
     10      0.016683805584112647   -1.0755285551056204e-16
     11      0.016683805584112647   -1.0755285551056204e-16
     12      0.016683805584112647   -1.0755285551056204e-16

[2] Quitting Docker Desktop didn't change that at all, even though it might've freed up some CPUs.

[3] (wcEcoli2)
      1      0.016683805584112754      0.000000000000000000
      2      0.016683805584112615     -0.000000000000000139
      3      0.016683805584112702     -0.000000000000000052
      4      0.016683805584112643     -0.000000000000000111
      5      0.016683805584112643     -0.000000000000000111
      6      0.016683805584112647     -0.000000000000000108
      7      0.016683805584112647     -0.000000000000000108
      8      0.016683805584112647     -0.000000000000000108
      9      0.016683805584112647     -0.000000000000000108
     10      0.016683805584112647     -0.000000000000000108
     11      0.016683805584112647     -0.000000000000000108
     12      0.016683805584112647     -0.000000000000000108

[4] (wcEcoli5) on Python 3.8.5; I don't remember whether it has numpy & scipy installed from binary vs. source linked to brew openblas. I deleted it when trying to rebuild wcEcoli3 but will recreate it.
      1      0.016683805584112754      0.000000000000000000
      2      0.016683805584112615     -0.000000000000000139
      3      0.016683805584112702     -0.000000000000000052
      4      0.016683805584112643     -0.000000000000000111
      5      0.016683805584112643     -0.000000000000000111
      6      0.016683805584112647     -0.000000000000000108
      7      0.016683805584112647     -0.000000000000000108
      8      0.016683805584112647     -0.000000000000000108
      9      0.016683805584112647     -0.000000000000000108
     10      0.016683805584112647     -0.000000000000000108
     11      0.016683805584112647     -0.000000000000000108
     12      0.016683805584112647     -0.000000000000000108

[5] # In wcm Docker Image, with NO_AVX2=1

THREADS               DOT PRODUCT                      DIFF
      1      0.016683805584112754                       0.0
      2      0.016683805584112615   -1.3877787807814457e-16
      3      0.016683805584112702    -5.204170427930421e-17
      4      0.016683805584112643   -1.1102230246251565e-16
      5      0.016683805584112643   -1.1102230246251565e-16
      6      0.016683805584112702    -5.204170427930421e-17
      7      0.016683805584112702    -5.204170427930421e-17
      8      0.016683805584112702    -5.204170427930421e-17
      9      0.016683805584112702    -5.204170427930421e-17
     10      0.016683805584112702    -5.204170427930421e-17
     11      0.016683805584112702    -5.204170427930421e-17
     12      0.016683805584112702    -5.204170427930421e-17

[6] wcEcoli3 rebuilt with numpy, scipy no-binary, linked to brew-installed openblas

No success yet installing a working numpy this way.

[7] wcEcoli3 rebuilt with numpy, scipy no-binary, linked to manually-compiled openblas with NO_AVX2=1

TODO
tahorst commented 4 years ago

Q: Parallelizing vector dot presumably divides the two vectors into one chunk per thread and adds their portions of the dot product. Floating point rounding will thus vary with the number of chunks. How much variation should be acceptable? Where to set the np.testing.assert_allclose(rtol=?, atol=?)?

Maybe this is just a problem that can't be solved if we allow parallelization of numpy if we'll always have issues with floating point rounding. I think the main issue arises from the stochastic nature of our sims. Any bit of floating point imprecision could lead to a slightly different random result which produces one molecule instead of another which can greatly amplify the difference (even if it's only off by 1e-17 but produces an essential protein that is needed vs some other protein). I think for that reason it would need to have 0 tolerance for the difference. wcEcoli2 on Sherlock produced the same results no matter how many threads were specified but that might just be because the OpenBLAS library was installed to only use 1 thread and not actually an issue with any of the other OpenBLAS versions tried. I've run with OPENBLAS_NUM_THREADS=1 for a long time which is why I could always reproduce the issues from Jenkins run on Sherlock with wcEcoli2.

U8NWXD commented 4 years ago

@1fish2 I was not using Docker, and OPENBLAS_NUM_THREADS was not set. The number of threads available to openblas could have explained the different behavior

1fish2 commented 4 years ago

Agreed. We might have to limit to OPENBLAS_NUM_THREADS=1. Is it sufficient to make that bold in the pyenv setup instructions and tell everyone? Print a warning if the value is different in the parca and sim Firetasks?

[6] numpy cannot load the brew-installed openblas on my Mac despite many tries to link them up. I don't know if that's specific to this Mac after messing with gcc@9 et al to reproduce the Python 2 runtime environment. It makes the Docker approach more attractive.

[7] wcEcoli5 on Python 3.8.5 rebuilt with numpy, scipy no-binary, linked to manually-compiled
openblas with "NO_AVX2=1" (and a table tweak)
THREADS               DOT PRODUCT        DIFF FROM 1 THREAD
      1      0.016683805584112754                       0.0
      2      0.016683805584112615   -1.3877787807814457e-16
      3      0.016683805584112702    -5.204170427930421e-17
      4      0.016683805584112643   -1.1102230246251565e-16
      5      0.016683805584112643   -1.1102230246251565e-16
      6      0.016683805584112702    -5.204170427930421e-17
      7      0.016683805584112702    -5.204170427930421e-17
      8      0.016683805584112702    -5.204170427930421e-17
      9      0.016683805584112702    -5.204170427930421e-17
     10      0.016683805584112702    -5.204170427930421e-17
     11      0.016683805584112702    -5.204170427930421e-17
     12      0.016683805584112702    -5.204170427930421e-17
             0.016683805584112702    -5.204170427930421e-17

[8] ditto but "NO_AVX2=0"
THREADS               DOT PRODUCT        DIFF FROM 1 THREAD
      1      0.016683805584112754                       0.0
      2      0.016683805584112615   -1.3877787807814457e-16
      3      0.016683805584112702    -5.204170427930421e-17
      4      0.016683805584112643   -1.1102230246251565e-16
      5      0.016683805584112643   -1.1102230246251565e-16
      6      0.016683805584112647   -1.0755285551056204e-16
      7      0.016683805584112647   -1.0755285551056204e-16
      8      0.016683805584112647   -1.0755285551056204e-16
      9      0.016683805584112647   -1.0755285551056204e-16
     10      0.016683805584112647   -1.0755285551056204e-16
     11      0.016683805584112647   -1.0755285551056204e-16
     12      0.016683805584112647   -1.0755285551056204e-16
             0.016683805584112647   -1.0755285551056204e-16

"NO_AVX2=1" produced identical results with 1-5 threads and slightly closer results with 6+ threads. What does that imply? It's not the wildly wrong results that made us set "NO_AVX2=1" for Docker for Mac. I guess using the vectorization hardware entails breaking the chunks into smaller vectorization subchunks.

(FWIW, 3 runs each at "NO_AVX2=1" and "NO_AVX2=0" took essentially the same amount of elapsed time, but this duration is probably dominated by process exec/quit, piping, and printing.)

1fish2 commented 4 years ago

BTW the embedded openblas in numpy & scipy gets the same results in this test as the manually-compiled openblas with "NO_AVX2=1" and in the same amount of time.

1fish2 commented 4 years ago
[9] Results on Mac in pyenv wcEcoli2 (with tweaked floating point formatting):
THREADS                DOT PRODUCT         DIFF FROM 1 THREAD
      1       0.016683805584112754                          0
      2       0.016683805584112615    -1.3877787807814457e-16
      3       0.016683805584112702    -5.2041704279304213e-17
      4       0.016683805584112643    -1.1102230246251565e-16
      5       0.016683805584112643    -1.1102230246251565e-16
      6       0.016683805584112647    -1.0755285551056204e-16
      7       0.016683805584112647    -1.0755285551056204e-16
      8       0.016683805584112647    -1.0755285551056204e-16
      9       0.016683805584112647    -1.0755285551056204e-16
     10       0.016683805584112647    -1.0755285551056204e-16
     11       0.016683805584112647    -1.0755285551056204e-16
     12       0.016683805584112647    -1.0755285551056204e-16
              0.016683805584112647    -1.0755285551056204e-16
[10] on a 4-CPU Sherlock node in pyenv wcEcoli2 (ditto wcEcoli-paper):
THREADS                DOT PRODUCT         DIFF FROM 1 THREAD
      1       0.016683805584112754                          0
      2       0.016683805584112754                          0
      3       0.016683805584112754                          0
      4       0.016683805584112754                          0
              0.016683805584112754                          0
[11] in pyenv wcEcoli3:
THREADS                DOT PRODUCT         DIFF FROM 1 THREAD
      1       0.016683805584112754                          0
      2       0.016683805584112615    -1.3877787807814457e-16
      3       0.016683805584112702    -5.2041704279304213e-17
      4       0.016683805584112643    -1.1102230246251565e-16
              0.016683805584112643    -1.1102230246251565e-16

Hypotheses:

Q. Is this OpenBLAS THREADs difference the only cause of parca differences?

@tahorst did this test data come from a parca run?

1fish2 commented 4 years ago

More tests on Mac We can rule out a change to OpenBLAS, Numpy, or Python (except maybe combinations):

[12] Python 3.8.3, numpy==1.19.0, openblas-0.3.5:
THREADS                DOT PRODUCT         DIFF FROM 1 THREAD
      1       0.016683805584112754                          0
      2       0.016683805584112615    -1.3877787807814457e-16
      3       0.016683805584112702    -5.2041704279304213e-17
      4       0.016683805584112643    -1.1102230246251565e-16
      5       0.016683805584112643    -1.1102230246251565e-16
      6       0.016683805584112702    -5.2041704279304213e-17
      7       0.016683805584112702    -5.2041704279304213e-17
      8       0.016683805584112702    -5.2041704279304213e-17
      9       0.016683805584112702    -5.2041704279304213e-17
     10       0.016683805584112702    -5.2041704279304213e-17
     11       0.016683805584112702    -5.2041704279304213e-17
     12       0.016683805584112702    -5.2041704279304213e-17
              0.016683805584112702    -5.2041704279304213e-17
[13] Python 3.8.3, numpy==1.14.6, openblas-0.3.5:
THREADS                DOT PRODUCT         DIFF FROM 1 THREAD
      1       0.016683805584112754                          0
      2       0.016683805584112615    -1.3877787807814457e-16
      3       0.016683805584112702    -5.2041704279304213e-17
      4       0.016683805584112643    -1.1102230246251565e-16
      5       0.016683805584112643    -1.1102230246251565e-16
      6       0.016683805584112702    -5.2041704279304213e-17
      7       0.016683805584112702    -5.2041704279304213e-17
      8       0.016683805584112702    -5.2041704279304213e-17
      9       0.016683805584112702    -5.2041704279304213e-17
     10       0.016683805584112702    -5.2041704279304213e-17
     11       0.016683805584112702    -5.2041704279304213e-17
     12       0.016683805584112702    -5.2041704279304213e-17
              0.016683805584112702    -5.2041704279304213e-17
[14] Python 2.7.16, numpy==1.14.6, openblas-0.3.5:
THREADS                DOT PRODUCT         DIFF FROM 1 THREAD
      1       0.016683805584112754                          0
      2       0.016683805584112615    -1.3877787807814457e-16
      3       0.016683805584112702    -5.2041704279304213e-17
      4       0.016683805584112643    -1.1102230246251565e-16
      5       0.016683805584112643    -1.1102230246251565e-16
      6       0.016683805584112702    -5.2041704279304213e-17
      7       0.016683805584112702    -5.2041704279304213e-17
      8       0.016683805584112702    -5.2041704279304213e-17
      9       0.016683805584112702    -5.2041704279304213e-17
     10       0.016683805584112702    -5.2041704279304213e-17
     11       0.016683805584112702    -5.2041704279304213e-17
     12       0.016683805584112702    -5.2041704279304213e-17
              0.016683805584112702    -5.2041704279304213e-17

@tahorst are you getting similar results on your local machine?

Did it ever produce consistent results on macOS with varying OPENBLAS_NUM_THREADS?

Hypotheses left?

tahorst commented 4 years ago

Q. Is this OpenBLAS THREADs difference the only cause of parca differences? did this test data come from a parca run?

I first noticed a difference between Sherlock and local runs when I couldn't reproduce the Jenkins failures locally. I checked the parca output and it showed differences. Then there were some updates (newer Theano version and other changes) and it looked like parca output was the same but sim output differed so I don't think there are actually any parca differences any more (it could be worth it to confirm again with THREADS unset). This test data is actually from a sim calculating the mass update for a time step.

are you getting similar results on your local machine?

I get the same results for all local environments tested and it looks like it matches you up to your 5 threads:

Python 3.8.3, numpy==1.19.0, openblas default from numpy Python 3.8.3, numpy==1.19.0, openblas-0.3.5 Python 2.7.16, numpy==1.14.6, openblas-0.3.5

THREADS                DOT PRODUCT         DIFF FROM 1 THREAD                                                                                  
      1       0.016683805584112754                          0                                                                                  
      2       0.016683805584112615    -1.3877787807814457e-16                               
      3       0.016683805584112702    -5.2041704279304213e-17                                      
      4       0.016683805584112643    -1.1102230246251565e-16                                                                                  
      5       0.016683805584112643    -1.1102230246251565e-16
      6       0.016683805584112647    -1.0755285551056204e-16                        
      7       0.016683805584112532    -2.2204460492503131e-16                         
      8       0.016683805584112532    -2.2204460492503131e-16                                
      9       0.016683805584112647    -1.0755285551056204e-16                              
     10       0.016683805584112643    -1.1102230246251565e-16                       
     11       0.016683805584112671    -8.3266726846886741e-17
     12       0.016683805584112536    -2.1857515797307769e-16                
              0.016683805584112536    -2.1857515797307769e-16

I'm thinking at this point we just make sure we set OPENBLAS_NUM_THREADS=1 (or install openblas with an option for only one thread). It seems like the OpenBLAS library linked on Sherlock for wcEcoli2 (0.3.9) is the only one that is independent from the threads specified. Did you test OpenBLAS version 0.3.9 in any other setups?

tahorst commented 4 years ago

I got the unit test pass locally by installing a single thread version of OpenBLAS (instructions from Sherlock with USE_THREAD=0 added) and linking numpy to that version with pip install numpy==1.19.0 --no-binary numpy and proper .numpy-site.cfg paths:

git clone https://github.com/xianyi/OpenBLAS
cd OpenBLAS/
git checkout v0.3.9
USE_THREAD=0 make FC=gfortran
make PREFIX=/opt/OpenBLAS-0.3.9 install

I think we can either specify instructions to install OpenBLAS with a single thread or use OPENBLAS_NUM_THREADS=1 and that should fix this issue.

1fish2 commented 4 years ago

Did you test OpenBLAS version 0.3.9 in any other setups?

Yes, and lately I've been using 0.3.10 on my Mac and in wcm-runtime/Dockerfile, including (IIRC) in the test runs above.

Initially we had to clone the OpenBLAS repo to get an unreleased bug fix. Now there are downloadable releases up to 0.3.10.

Travis found that building OpenBLAS with NO_AVX2=1 produces different results, so it's better to avoid that when possible, and building on Linux with NO_AVX2=0 inside a Dockerfile produces an Image that ran just fine on Mac! Let's test that with a Cloud Build and if it all works, edit cloud/docker/runtime/Dockerfile and related files.

1fish2 commented 4 years ago

Testing a cloud-built wcm-code using the test-openblas branch plus these adjustments not checked in:

(1) test_openblas_threads.py gets results consistent with pyenv wcEcoli2 on Sherlock:

THREADS                DOT PRODUCT         DIFF FROM 1 THREAD
      1       0.016683805584112754                          0
      2       0.016683805584112754                          0
      3       0.016683805584112754                          0
      4       0.016683805584112754                          0
      5       0.016683805584112754                          0
      6       0.016683805584112754                          0
              0.016683805584112754                          0

That further validates Travis' idea that we can build Docker Images with NO_AVX2=0 on Linux and still run on Mac, and it disproves the OpenBLAS team's hypothesis about where the AVX2 problem is.

(2) That's surprising since the macOS host outside Docker was not getting consistent results (see above) but -- whoa! -- it does now! I do not know change made that happen but testing with a Python 3.8.5 virtualenv that has numpy==1.19.0 shows it's not numpy update.

(3) Inside Docker-for-Mac without the --user $(id -u):$(id -g) option, $(id -u):$(id -g) is 0:0; in the host it's 501:20, and yet it creates files with the right host-side owner in a mapped directory! The --user option was a workaround we needed in Google Compute Engine, but maybe it's irrelevant in Docker Desktop.

@tahorst, does this happen in Docker-for-Windows?

(4) I tried keeping the openblas build directory, then running make tests in that directory inside the container. It doesn't get far:

touch libopenblas_haswellp-r0.3.10.a
make -j 6 -C test all
make[1]: Entering directory '/openblas/OpenBLAS-0.3.10/test'
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64   -o sblat1 sblat1.o ../libopenblas_haswellp-r0.3.10.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-linux-gnu/8 -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/8/../../..  -lc
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64   -o dblat1 dblat1.o ../libopenblas_haswellp-r0.3.10.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-linux-gnu/8 -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/8/../../..  -lc
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64   -o cblat1 cblat1.o ../libopenblas_haswellp-r0.3.10.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-linux-gnu/8 -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/8/../../..  -lc
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64   -o zblat1 zblat1.o ../libopenblas_haswellp-r0.3.10.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-linux-gnu/8 -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/8/../../..  -lc
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64   -o sblat2 sblat2.o ../libopenblas_haswellp-r0.3.10.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-linux-gnu/8 -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/8/../../..  -lc
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64   -o dblat2 dblat2.o ../libopenblas_haswellp-r0.3.10.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -L/usr/lib/gcc/x86_64-linux-gnu/8 -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/8/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/8/../../..  -lc
///usrusrusr//binbin///binldld/::ld  :.. .././libopenblas_haswellp.libopenblas_haswellp-/-r0.3.10.alibopenblas_haswellp:r0.3.10.a -:filer0.3.10.a  :notfile   filerecognizednot : not/ recognized :file usrfilerecognized  /truncatedtruncated:bin

 /fileld :truncated
.collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
./libopenblas_haswellp-r0.3.10.a: file not recognized: file truncated
collect2: error: ld returned 1 exit status
...

(5) runParca gets nearly done then prints:

Fitting promoter binding
Killed

and exits with code 137.

==> I had to adjust Docker Desktop Preferences/Resources/Memory from 2GB to 4GB. (Would 3GB suffice?) I'm adding a note to the README.

(6) compareParca.py reports that the parca output from this Docker container is the same as with numpy 1.19.0 but has lots of numeric differences from Mac outside Docker.

tahorst commented 4 years ago

(2) That's surprising since the macOS host outside Docker was not getting consistent results (see above) but -- whoa! -- it does now! I do not know change made that happen but testing with a Python 3.8.5 virtualenv that has numpy==1.19.0 shows it's not numpy update.

The results were inside a Docker container? Do you think there was only 1 CPU available when it was built in the cloud so when compiling OpenBLAS, the max number of threads was set to 1?

@tahorst, does this happen in Docker-for-Windows?

Docker-for-Windows actually runs through WSL2 which is actually a Linux kernel so unsurprisingly it gets the same behavior that we see with other Linux environments and writes as root if the --user option isn't supplied.

(4) I tried keeping the openblas build directory, then running make tests in that directory inside the container. It doesn't get far:

That's for the new image run on Mac? You still get good results when running the parca/sim though?

==> I had to adjust Docker Desktop Preferences/Resources/Memory from 2GB to 4GB. (Would 3GB suffice?) I'm adding a note to the README.

There's actually already a note about this in docs/README.md:

NOTE: If you encounter memory issues while using Docker Desktop (the default allocated memory is 2GB) causing the simulation to get killed midway, click the Docker icon > Preferences > Advanced > adjust memory to 4GB.
1fish2 commented 4 years ago

(2) That's surprising since the macOS host outside Docker was not getting consistent results (see above) but -- whoa! -- it does now! I do not know change made that happen but testing with a Python 3.8.5 virtualenv that has numpy==1.19.0 shows it's not numpy update.

The results were inside a Docker container?

Results (1) were inside the container (where it thinks there are 6 CPUs -- default Docker allocation), while results (2) were outside the container (where it thinks there are 12 CPUs).

Do you think there was only 1 CPU available when it was built in the cloud so when compiling OpenBLAS, the max number of threads was set to 1?

Could be. Any idea how to tell by examining the /openblas/OpenBLAS-0.3.10 build directory or the libopenblas_haswell-r0.3.10.so file?

4) I tried keeping the openblas build directory, then running make tests in that directory inside the container. It doesn't get far:

That's for the new image run on Mac?

Yes.

You still get good results when running the parca/sim though?

I just started 2 sim gens. We'll see.

There's actually already a note about this in docs/README.md:

Arg. I'll change it to a setup step so it's harder to miss.

tahorst commented 4 years ago

Could be. Any idea how to tell by examining the /openblas/OpenBLAS-0.3.10 build directory or the libopenblas_haswell-r0.3.10.so file?

Do you have the build log? It should say if it was single threaded when it was compiled or how many CPUs it was configured for. Not sure how to get that info after compilation...

1fish2 commented 4 years ago

The build log confirms that OpenBLAS was single threaded: libopenblas_haswell-r0.3.10.a (Single-threading).

I'll test USE_THREAD=0 on Mac to see if that gets results consistent with the others.

In the previous tests inside and outside Docker, the parca and sim outs differ although I didn't run analysis plots yet. Here are the behavior metrics, if that helps judge for good results:

docker-numpy-1.19.1:

                         metric                      mode  expected_min  expected_max     value  passes
0                       dnaMass           end_start_ratio       1.98150      1.981600   1.98511   False
1                  doublingTime             doubling_time    2910.00000   2920.000000      2950   False
2            fluxomeCorrelation               correlation       0.63497      0.634980  0.658865   False
3          limiting_metabolites      limiting_metabolites           NaN           NaN        {}   False
4          limiting_metabolites  num_limiting_metabolites       2.00000      2.000000         0   False
5                      mRnaMass           end_start_ratio       2.43290      2.433000   2.18449   False
6                   proteinMass           end_start_ratio       2.11340      2.113500   2.12372   False
7                   proteinMass                       max     342.72000    342.730000   344.604   False
8                   proteinMass                      mean     241.33000    241.340000   242.366   False
9                   proteinMass                       min     162.16000    162.170000   162.264   False
10                  proteinMass                     stdev      51.85900     51.860000   52.4276   False
11         proteome_correlation       schmidt_correlation       0.75000      0.750010  0.746113   False
12         proteome_correlation    wisniewski_correlation       0.61569      0.615700  0.612515   False
13                     rRnaMass           end_start_ratio       2.25150      2.251600   2.22151   False
14  replication_elongation_rate                      mean    1932.40000   1932.500000   1932.46    True
15                     ribosome      mean_active_fraction       0.86770      0.867702  0.870178   False
16                     ribosome        mean_concentration   16186.00000  16187.000000   15907.1   False
17                rnaPolymerase      mean_active_fraction       0.24678      0.246790   0.24749   False
18                rnaPolymerase        mean_concentration    3294.60000   3294.700000   3194.87   False
19            smallMoleculeMass           end_start_ratio       2.07820      2.078300   2.07858   False
20                     tRnaMass           end_start_ratio       2.28240      2.282500   2.25283   False
21  transcriptionElongationRate                      mean   59155.00000  59156.000000   57590.1   False
22    translationElongationRate                      mean      15.79300     15.794000    16.035   False

mac-numpy-1.19.1:

                         metric                      mode  expected_min  expected_max             value  passes
0                       dnaMass           end_start_ratio       1.98150      1.981600           1.97915   False
1                  doublingTime             doubling_time    2910.00000   2920.000000              2944   False
2            fluxomeCorrelation               correlation       0.63497      0.634980          0.658657   False
3          limiting_metabolites      limiting_metabolites           NaN           NaN  {VAL[c], GLT[c]}    True
4          limiting_metabolites  num_limiting_metabolites       2.00000      2.000000                 2    True
5                      mRnaMass           end_start_ratio       2.43290      2.433000           2.17393   False
6                   proteinMass           end_start_ratio       2.11340      2.113500           2.12721   False
7                   proteinMass                       max     342.72000    342.730000           345.171   False
8                   proteinMass                      mean     241.33000    241.340000           242.583   False
9                   proteinMass                       min     162.16000    162.170000           162.264   False
10                  proteinMass                     stdev      51.85900     51.860000           52.5483   False
11         proteome_correlation       schmidt_correlation       0.75000      0.750010          0.750456   False
12         proteome_correlation    wisniewski_correlation       0.61569      0.615700          0.615472   False
13                     rRnaMass           end_start_ratio       2.25150      2.251600           2.21675   False
14  replication_elongation_rate                      mean    1932.40000   1932.500000           1932.46    True
15                     ribosome      mean_active_fraction       0.86770      0.867702          0.872894   False
16                     ribosome        mean_concentration   16186.00000  16187.000000           15693.1   False
17                rnaPolymerase      mean_active_fraction       0.24678      0.246790          0.246992   False
18                rnaPolymerase        mean_concentration    3294.60000   3294.700000           3195.17   False
19            smallMoleculeMass           end_start_ratio       2.07820      2.078300           2.07505   False
20                     tRnaMass           end_start_ratio       2.28240      2.282500           2.25239   False
21  transcriptionElongationRate                      mean   59155.00000  59156.000000           57537.5   False
22    translationElongationRate                      mean      15.79300     15.794000           16.3066   False
1fish2 commented 4 years ago
1fish2 commented 4 years ago

@tahorst do you have a test case for different output with NO_AVX2=1?

tahorst commented 4 years ago

What kind of output do you want for the test case?

1fish2 commented 4 years ago

Maybe more than one test case.

tahorst commented 4 years ago

I'm not sure exactly where the differences are coming from so I don't know if we can create a simple test case for the NO_AVX2 difference but can look into it.

1fish2 commented 4 years ago

We could brainstorm debugging techniques. E.g. run w/ and w/o NO_AVX2 in different Docker containers and compare a sequence of data snapshots.

tahorst commented 4 years ago

You're thinking of looking at different points when the parca is running? My guess from the diffs that show up when comparing the final sim_data objects from the different Docker containers w/ and w/o NO_AVX2 is that the difference might be coming when we fit gene expression so it could be the Theano functions or the solver there but I would have to look into it to confirm. Comparing sim_data before and after that in different containers to confirm could be a good way to isolate a simpler test case.

1fish2 commented 4 years ago

Good idea.

Top level questions:

Findings on Sherlock (pyenv wcEcoli3-staging):

  1. The ~/.numpy-site.cfg file needs a runtime_library_dirs = ... line to install numpy in a way that it reliably loads the named openblas at runtime.
    • Sherlock now has an environment module for openblas 0.3.10 but the module has to be loaded for numpy to load it at runtime, which clashes with the old openblas module. Our compiled openblas 0.3.9 doesn't have that limitation.
  2. Installing scipy linked to that openblas needs LDFLAGS="-shared $LDFLAGS", otherwise the compiler will fail with "crt1.o in function _start undefined reference to main".
  3. A login node now has 32 CPU cores and (with OPENBLAS_NUM_THREADS=1) it runs test_library_performance faster than a 5 year old compute node with 1 allocated CPU.

Findings in Docker:

  1. Always build openblas and build numpy and scipy linked to it.
    • Document how we reached that conclusion...
    • Build openblas with NO_AVX2=0 in Docker-for-Linux, NO_AVX2=1 in Docker-for-Mac, regardless of where it'll run, because NO_AVX2=0 computes more consistent results and runs fine in Docker-for-Mac (contrary the openblas team's hypothesis).
    • Set runtime_library_dirs for good measure and as a better reference for manual pyenv setup?

Findings on Mac & Linux?

Also:

1fish2 commented 4 years ago

The different ways to install OpenBLAS and NumPy × different runtime environments produce at least 7 different equivalence classes of Parca outputs. There does not seem to be an installation approach (even if the instructions are environment-specific) to get cross-platform consistent Parca output.

openblas variations

See https://github.com/xianyi/OpenBLAS/issues/2244#issuecomment-696510557

It could be useful to investigate where these Parca computations diverge, but I'm not inclined to spend the time on that.

1fish2 commented 3 years ago

The bug fix #980 for portability issue #979 raises this question: Is numpy's random number generator returning different results between @ggsun's local machine and Sherlock?

What could cause that?

There are uses of the default random instance in prototypes/, tests, and sim variants (those should be OK), analysis scripts (problematic when analyses run in parallel but won't affect the sims), and polymerize (that's concerning).

When moving to Python 3, we also moved from numpy 1.14.6 to 1.19.2. numpy.random did change in 1.17.0 (see NEP 19), mainly by adding a new generator PCG64 "which has better statistical properties than the legacy MT19937 used in RandomState". (It's faster, too.)

NEP 19 also says RandomState overpromised on compatibility.

1fish2 commented 3 years ago

runSim --intermediates-directory and comparePickles are useful tools to help localize this problem.

Also see #1097

1fish2 commented 3 years ago

The CACM article Keeping Science on Keel When Software Moves describes techniques and tools that a climate modeling project uses to debug reproducibility problems in their large models.

The article lacks key details, but the FLiT tool's technique as described might be relevant.

Also, all the examples in the article came down to compile optimizations that used Fused Multiply-Add (FMA) floating point instructions.

tahorst commented 3 years ago

It's nice to see there's a tool to test for the floating point issues and it's interesting that other people have run into these problems. Do you think there are compiler flags for BLAS (I'm assuming that's where this is coming from but could be other dependencies) that make it slightly less efficient but reproducible? I also wonder if we can truncate/round floating point results in our code where these differences come up so our precision is less but we get above the machine level noise.

1fish2 commented 3 years ago

It'd be good to brainstorm on this problem.

More observations:

I also wonder if we can truncate/round floating point results in our code where these differences come up so our precision is less but we get above the machine level noise.

Interesting idea! It might require localizing the cause to particular pieces of code. Then it might be able to erase some hardware variability and some order-of-evaluation variability.

This option sounds promising for performance in Docker but riskier for reproducibility:

OpenBLAS can be built for multiple targets with runtime detection of the target cpu by specifiying DYNAMIC_ARCH=1 in Makefile.rule, on the gmake command line or as -DDYNAMIC_ARCH=TRUE in cmake.

1fish2 commented 3 years ago

Another variable is the C++ compiler: clang on macOS vs. gcc on Ubuntu, Docker, and CentOS (Sherlock). We could set up clang (how?) for use by pyenv, pip, Aesara, Cython, & Numba.

What triggered this realization is Santiago and Niels hit a problem on Ubuntu where Parca calls Aesara which fails to compile & link some code. A workaround is to install Python using PYTHON_CONFIGURE_OPTS="--enable-shared" pyenv install 3.8.7.