dougshidong / PHiLiP

Parallel High-Order Library for PDEs through hp-adaptive Discontinuous Galerkin methods
Other
45 stars 37 forks source link

High order grid memory should mainly be from storing the volume nodes and not from creating FEValues, MappingQGeneric and Quadrature objects. #221

Open AlexanderCicchino opened 1 year ago

AlexanderCicchino commented 1 year ago

The last major memory issue for curvilinear overintegration comes from creating the MappingQGeneric object and fe value for get_position_vector in high_order_mesh. This could be fixed by having an option to extract the volume_nodes vector from the mesh->grids->warp function.

dougshidong commented 1 year ago

Can you clarify what the issue is? HighOrderGrid memory should scale with grid degree and not quadrature. The quadrature order will usually be larger than the grid degree. Furthermore, HighOrderGrid doesn't care about quadratures, for all it cares, it is used with a non-finite element solver.

AlexanderCicchino commented 1 year ago

Yes, I agree with the above. For clarification--the issue is in HighOrderGrid get_position_vector. Line 250 it creates a quadrature based off the grid nodes, the creates MappingQGeneric and FEValues objects. As the polynomial order increases, these objects end up using about as much memory as the global solution. I changed the title of the issue because I agree it didn't sound correct because its not necessarily DG's quadrature but more than it creates these dim quadrature-based objects.

dougshidong commented 1 year ago

How did you assess that those were consumming a lot of memory?

From what I see, line 250 is a single Quadrature object. It doesn't even have weights in it. So if you were to use a grid with a trillion cells, with a HighOrderGrid of p=99, this would still just create a single deal.II Quadrature containing 100^3 quadrature points. 10,00,000 points with 3 coordinates, each of size 8 bytes. That would be a whooping 80 megabytes, which is the least of your worries at this scale. Let me know if your understanding is different.

Similarly for MappingQGeneric and FEValues, it only stores information about reference cells. They will compute stuff on 1 physical cell at a time. Therefore, it doesn't actually scale with the global solution. Do let me know if that's erroneous as well; it's not also straightforward to deep dive into deal.II code.

Also, memory use scaling proportionally to the solution is usually not your main issue. If you have 1000x ndofs, you will likely need to use 1000x the number of processors, resulting in about 1000x the memory. However, usually, if it works on a small case, it will work on a bigger case, because you should be scaling both your problem size and ressources proportionally. Of course, if you have multiple useless copies of something, then we should try to avoid them.

For reference, this is an old question I had asked in the deal.II groups: https://groups.google.com/g/dealii/c/S78RCluEMJI/m/FZSMphxsGQAJ

AlexanderCicchino commented 1 year ago

I used valgrind. Theoretically I would agree with you, but for the 3D objects it seems that dealii uses a lot of memory for them--please see the massif file attached. I agree with you in terms of flops, they compute all the desired quantities when called within the cell, but to create the 3D object requires a significant amount of memory when compared to the solution. For example, the MappingQGeneric in HighOrderMesh uses 1005.2 MiB and the Quadrature in HighOrderMesh uses 981.1 MiB for this example. The other large memory uses come from the 3D Quadrature and 3D FeDGQArbitraryNodes objects in DGBase. massif.out.zip

dougshidong commented 1 year ago

I'll guess that this was used with a polynomial of ~28?

About 1Gb of memory is used in MappingQGeneric here. This is because deal.II seems to precompute a bunch of things. Note that this is independent of number of cells since it's all reference cell stuff. This cost will not increase even if you go from 10 cells to 10 billion cells. I believe deal.II made some improvements to this class in the past years. I didn't check what compute_support_point_weights_on_hex does, and why it scales with p^5.

The other ones are associated with FESystem. About 4 of them at 245.3 MiB a pop. Again, FESystem will not scale with number of cells. Worth investigating why we have 4 of them needed. We need at least 2, one for solution, one for grid. Probably some useless copy somewhere, but then again 245MiB at p=28 is not bad as long as we don't have like 20 of them.

Overall, I don't see some big red flags. At p=28, overhead memory starts being expected. At this point, we're just doing p-fem. Unless we start replacing deal.II code with code optimized for p-FEM, not sure what the solution is here.

Do we expect to run p=28? If we do, just run it on a system with a moderately decent RAM?

AlexanderCicchino commented 1 year ago

It was ran on p=20. The high polynomial order was needed for the scaling tests of our solver because it hits the asymptotic range after about p=15. For the 4 FE systems, there's the FE Collection used in DG weak, Fe collection Lagrange that's not used anywhere, and the other ones used in DG strong are all 1D fe collections (which are very small in memory size). Using the operators class, we don't need any 3D Fe systems, 3D Quadratures etc. It's not a pressing issue, but these fixes will help us to demonstrate the strong scaling of the code with memory usage. I think the fix will be to gradually switch to using 1D finite elements with operators when applicable.