erdc / proteus

A computational methods and simulation toolkit
http://proteustoolkit.org
MIT License
88 stars 56 forks source link

Vectorization of RANS2P2D #1232

Open JohanMabille opened 4 years ago

JohanMabille commented 4 years ago

Mandatory Checklist

Please ensure that the following criteria are met:

As a general rule of thumb, try to follow PEP8 guidelines.

Description

codecov[bot] commented 4 years ago

Codecov Report

Merging #1232 (b04f9c3) into main (7f4f32b) will increase coverage by 5.18%. The diff coverage is n/a.

:exclamation: Current head b04f9c3 differs from pull request most recent head 6dbd9e2. Consider uploading reports for the commit 6dbd9e2 to get more accurate results

@@            Coverage Diff             @@
##             main    #1232      +/-   ##
==========================================
+ Coverage   47.56%   52.74%   +5.18%     
==========================================
  Files          90      531     +441     
  Lines       71776   109533   +37757     
==========================================
+ Hits        34140    57777   +23637     
- Misses      37636    51756   +14120     
Impacted Files Coverage Δ
proteus/NumericalSolution.py 70.73% <0.00%> (-7.41%) :arrow_down:
proteus/mprans/RDLS.py 66.98% <0.00%> (-7.40%) :arrow_down:
proteus/Archiver.py 31.64% <0.00%> (-4.55%) :arrow_down:
proteus/TwoPhaseFlow/TwoPhaseFlowProblem.py 92.96% <0.00%> (-2.84%) :arrow_down:
proteus/Gauges.py 93.58% <0.00%> (-1.19%) :arrow_down:
proteus/mprans/BodyDynamics.py 85.73% <0.00%> (-0.74%) :arrow_down:
proteus/iproteus.py 24.53% <0.00%> (-0.63%) :arrow_down:
proteus/default_so.py 90.90% <0.00%> (-0.40%) :arrow_down:
proteus/LinearSolvers.py 57.83% <0.00%> (-0.30%) :arrow_down:
proteus/Profiling.py 47.15% <0.00%> (-0.28%) :arrow_down:
... and 462 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update bf2bf66...6dbd9e2. Read the comment docs.

JohanMabille commented 4 years ago

@cekees @zhang-alvin @tridelat This is ready for a review. I haven't replaced the data()[...] in the call of methods of CompKernel since I will add a new class that accepts xtensor objects (but I will do that in a dedicated PR).

Can you confirm that this does not hurt performance?

zhang-alvin commented 4 years ago

I ran a cutfem-based 2D case using this branch and the master branch. There was no major difference in the running times.

cekees commented 4 years ago

@cekees @zhang-alvin @tridelat This is ready for a review. I haven't replaced the data()[...] in the call of methods of CompKernel since I will add a new class that accepts xtensor objects (but I will do that in a dedicated PR).

Can you confirm that this does not hurt performance?

Nice work! @jhcollins you might have some parallel jobs set up where you could do a timing comparison as well. My allocations on HPC are not ready yet, but I'll test some compute intensive jobs on mac osx and linux.

@JohanMabille did you make this conversion by hand or did you write a python script? If via script, it would be nice if you could add that to the scripts directory for future use.

JohanMabille commented 4 years ago

I did this one by hand because I wanted to see if I could add other simplifications (like replacing initialization loops). I can work on a Python script for the other files.

jhcollins commented 4 years ago

@cekees do you want the parallel timing comparison using a mesh conforming setup or cutfem like alvin?

cekees commented 4 years ago

@cekees do you want the parallel timing comparison using a mesh conforming setup or cutfem like alvin?

Sorry, just saw this. I think we need to verify the performance on something we've run a lot and load up the cores with mesh nodes. Maybe a dambreak or one of your wave flume simulations and try it on 2 or 3 core counts so you can get maybe 1000 vertices per core, 2000 vertices per core and 4000 vertices per core. In 2D you can likely get more like 20,000 vertices per core. If you run it with --profiling you should get a list of the top 20 functions. Typically the residual and jacobian for RANS2P will make it onto the list. The PETSc solve and preconditioner setup would be the top costs, in the 80-90% range, then below that we should see the calculateResidual and calculateJacobian functions. If yuou have go-to FSI simulation, like a floating caisson with ALE, that would be handy because it tests more of the functionality.

cekees commented 4 years ago

My timings are looking great @JohanMabille. I'll merge this tomorrow once a few large jobs run on HPC platforms from Cray and SGI, and I confirm the results are identical and timings equivalent. So far I see some cases where the new implementation appears faster, but it may just be some kind of load fluctuations (though these tests are done on dedicated nodes).

cekees commented 4 years ago

@JohanMabille and @jhcollins I verified that the numerical results are essentially identical on a 2D dambreak (two-phase) and 2D FSI (two-phase with mesh deformation/ALE). There are some differences on the order of 1e-22, which I suspect have to do with the compiler taking different paths at the aggressive -O3 optimization level. For both a "standard load" of 2000 vertices per core and a heavier load of about 10,000 vertices per core, the new indexing is actually slightly faster. @jhcollins let me know if you are able to identify the issue where you found non-trivial differences in the computed solutions. I tested on a Cray XC40 with gnu 7.3.0 compilers.