CASL / Futility

VERA Fortran Utilities
Other
44 stars 20 forks source link

Native GMRES #231

Closed mattkab2 closed 4 years ago

mattkab2 commented 4 years ago

Improves the native GMRES capability of Futility.

aarograh commented 4 years ago

@mattkab2 Any plans to address review comments on this? The pull request is getting a bit stale.

mattkab2 commented 4 years ago

@aarograh There have been, but things keep piling up. Since I'm working on a writeup for this, it's a little higher priority than usual and I've been bringing it up to recent master. Many of these comments have been addressed in past commits, so once it is no longer stale I will ping the commenters to get their feedback on the changes.

mattkab2 commented 4 years ago

@djabaay @nfherrin @aarograh @HendersonSC If you guys could go over your comments, I believe all of them should be addressed. Let me know if you have further questions.

mattkab2 commented 4 years ago

With the latest commits the native CMFD now runs faster than PETSc for my test problem (5a-2d) by 15%, with a larger speedup expected in 3D problems. There's also a memory reduction (100-200 MB).

mattkab2 commented 4 years ago

@HendersonSC @djabaay Could you guys take a look at your comments and see if they've been addressed?

stimpsonsg commented 4 years ago

@mattkab2 did you do a scaling study on 5a-2D? ...if not, it would be worthwhile.

How many cores did you see a 15% speedup on (16?)?

Be sure to keep the ticket updated with any results you have: https://vminfo.casl.gov/trac/casl_phi_kanban/ticket/5855

mattkab2 commented 4 years ago

@stimpsonsg I did a scaling study but only for 1 node; I had some difficulties with the new SLURM system for multi-node jobs. The 15% speedup remained constant from 8-24 processors.

Thanks for reminding me, I'll make sure to update it.

aarograh commented 4 years ago

@mattkab2 In your scaling study, were you using the -march=native flag to compile? If so, did you perform the study without that flag as well?

stimpsonsg commented 4 years ago

Also, was that with the same number of outer iterations? You mentioned that the preconditioning isn't there yet, so the number of iterations could vary a bit in theory. RBSOR, for example, is faster running but slower to converge, but still observed a speedup even with additional iterations...just worth noting to really understand if the gains are from the parallel efficiency or something else about the solver.

mattkab2 commented 4 years ago

@aarograh The march=native flag was used on the build. The same build was used to generate the native solver results and petsc results. Without the flag, the native solver is not competitive because it has been written to take advantage of fused-multiply-add and vector instructions.

@stimpsonsg The number of outers is the same. Preconditioning does not change the number of outer iterations because all that is changing in this code is the method of solving the CMFD matrix, which is not altered in any way. This is part of the motivation for this work as it can be used with any variant of CMFD, even multilevel schemes as a drop-in replacement.

stimpsonsg commented 4 years ago

Okay, that's good that they have consistent outers. If you lack preconditioning and hit the max inner iteration limits, it could run into a situation where you need more outers.

Do you have actual numbers without --march=native?

mattkab2 commented 4 years ago

@stimpsonsg I only did it for 8 processors and discarded the results, but I can regenerate them fairly easily. I'll include it when I update the ticket.

aarograh commented 4 years ago

@mattkab2 so my concern with using that flag is this: the comparisons to petsc are not fair if petsc was not also compiled using that flag. Since we typically link against petsc as a TPL, I'm guessing that's the case. Did you build your own version of petsc using -march=native? Do you know if they use similar compiler flags to improve optimizations? I'm guessing they don't since clusters frequently can have small variations in hardware.

mattkab2 commented 4 years ago

@aarograh You are right, I suppose this is not a fair comparison then. I recall building a separate version of PETSc for the much older tests on Flux (old cluster), but I forgot that step for the new tests on Lighthouse (new cluster).

Thanks for pointing that out. I'll have to go build it and see what results we see.

mattkab2 commented 4 years ago

@aarograh I do want to be clear though that there is reason to believe that the native solver will still be faster. It makes extensive use of BLAS/LAPACK, which, like PETSc, would also not have received the march=native flags. I'm building these libraries now, and have results by early tomorrow.

mattkab2 commented 4 years ago

@aarograh @HendersonSC Getting the build process to link to the march=native compilations of PETSC/LAPACK rather than the cluster defaults was trickier than expected, but the overall results are very similar. The speedup is not quite as large, but still ranges from 8-12%. Memory savings are not affected, and are still about 150 MB per core in the 8 processor case.

The reason for the decline in speedup, however, is not due to PETSc getting faster but the native solver becoming slower. I do not know the reason for this but I am looking into it.

mattkab2 commented 4 years ago

Below is a table of the results for the 5a-2D test case on the Lighthouse cluster at Michigan. Speedup is expected to be better for 3D cases, and for architectures Skylake or newer (Lighthouse is Haswell).

speedupFactor

The "unoptimized" here means no machine-specific optimizations (march=native). The top line is the standard MPACT build we have on Lighthouse, which shows an improvement even without these enabled. The two lines that follow are MPACT builds compiled with PETSc and linked to one of two linear algebra packages (OpenBLAS or LAPACK). In these cases, all packages were compiled with machine-specific optimizations. OpenBLAS is a good candidate for pushing results even further.

aarograh commented 4 years ago

@mattkab2 Thanks for posting the results. make sure they find their way to the casl ticket too

@HendersonSC @nfherrin If you guys could resolve anything that has been addressed, please do so. If it still needs to be addressed, comment to let Matt know. I'd like to get this pull request finished up before it gets stale again.

aarograh commented 4 years ago

@mattkab2 I think everything here seems reasonable. Tests are passing in MPACT with your most recent Futility head? If so I'll merge.

mattkab2 commented 4 years ago

@aarograph Yes, all heavies on MPACT which I can run are passing with these changes. Same with unused variables/serial debug.