firedrakeproject / firedrake

Firedrake is an automated system for the portable solution of partial differential equations using the finite element method (FEM)
https://firedrakeproject.org
Other
482 stars 157 forks source link

BUG: `-log_view` reports mismatch between objects created and destroyed when running with multiple procs #3629

Open lindsayad opened 2 weeks ago

lindsayad commented 2 weeks ago

Describe the bug

Object Type          Creations   Destructions. Reports information only for process 0.

--- Event Stage 0: Main Stage

           Container     8              5
           Index Set   170            149
   IS L to G Mapping    18             15
             Section    70             61
   Star Forest Graph    75             63
              Vector    21              7
              Matrix    14             11
      Preconditioner     1              1
       Krylov Solver     1              1
     DMKSP interface     1              0
                SNES     1              1
              DMSNES     1              0
    GraphPartitioner     4              3
    Distributed Mesh    14              6
            DM Label    47             34
     Discrete System    22             14
           Weak Form    22             14

Steps to Reproduce Here is the example I'm running

from firedrake import *                                                                                                   
from petsc4py import PETSc                                                                                                
import gc                                                                                                                 

def run():                                                                                                                
    size = 3                                                                                                              

    # Create mesh                                                                                                         
    mesh = UnitSquareMesh(2 ** size, 2 ** size, quadrilateral=quadrilateral)                                              
    x = SpatialCoordinate(mesh)                                                                                           

    # Define function spaces and mixed (product) space                                                                    
    if quadrilateral:                                                                                                     
        BDM = FunctionSpace(mesh, "RTCF", 1)                                                                              
    else:                                                                                                                 
        BDM = FunctionSpace(mesh, "BDM", 1)                                                                               
    DG = FunctionSpace(mesh, "DG", 0)                                                                                     
    W = BDM * DG                                                                                                          

    # Define trial and test functions                                                                                     
    sigma, u = TrialFunctions(W)                                                                                          
    tau, v = TestFunctions(W)                                                                                             

    # Define source function                                                                                              
    f = Function(DG).interpolate(-2*(x[0]-1)*x[0] - 2*(x[1]-1)*x[1])                                                      

    # Define variational form                                                                                             
    a = (inner(sigma, tau) + inner(u, div(tau)) + inner(div(sigma), v))*dx                                                
    L = - inner(f, v)*dx                                                                                                  

    # Compute solution                                                                                                    
    w = Function(W)                                                                                                       
    solve(a == L, w, solver_parameters=parameters)                                                                        

    PETSc.garbage_cleanup(PETSc.COMM_SELF)                                                                                
    PETSc.garbage_cleanup(mesh._comm)                                                                                     
    gc.collect()                                                                                                          

if __name__ == "__main__":                                                                                                
    run()                                                                                                                 
    gc.collect()                         

Command:

mpiexec -np 2 python3 mixed-poisson.py -log_view

Expected behavior I expect the number of destructions to match the number of creations. They do match when run in serial

Environment:

wence- commented 2 weeks ago

Try a petsc garbage cleanup after the gc.collect as well, I think.

lindsayad commented 2 weeks ago

I tried adding all the petsc garbage cleanups

    PETSc.garbage_cleanup(PETSc.COMM_SELF)                                                                                
    PETSc.garbage_cleanup(mesh._comm)                                                                                     
    gc.collect()                                                                                                          
    PETSc.garbage_cleanup(PETSc.COMM_SELF)                                                                                
    PETSc.garbage_cleanup(mesh._comm)                                                                                     

if __name__ == "__main__":                                                                                                
    run()                                                                                                                 
    PETSc.garbage_cleanup(PETSc.COMM_SELF)                                                                                
    gc.collect()                                                                                                          
    PETSc.garbage_cleanup(PETSc.COMM_SELF)

and the result is the same

dham commented 2 weeks ago

I suspect the problem is still garbage collection. Python just doesn't provide any guarantees about when reference cycles will be cleared, and if that's only while the interpreter is getting pulled apart, then in Parallel you will still have leaked objects.

connorjward commented 2 weeks ago

PETSc.garbage_cleanup(PETSc.COMM_SELF) will do nothing. I contributed some code upstream that means that we only defer the destruction of objects whose communicator has size greater than 1.

To me it seems feasible that we could be caching PETSc objects in some of our global caches that only get cleared up at interpreter shutdown.

jrmaddison commented 2 weeks ago

An extra cleanup is needed after the final garbage collection.

    ...
    return mesh._comm                                                                                                         

if __name__ == "__main__":                                                                                                
    comm = run()                                                                                                                 
    gc.collect()                         
    PETSc.garbage_cleanup(comm)
wence- commented 2 weeks ago

PETSc.garbage_cleanup(PETSc.COMM_SELF) will do nothing. I contributed some code upstream that means that we only defer the destruction of objects whose communicator has size greater than 1.

To me it seems feasible that we could be caching PETSc objects in some of our global caches that only get cleared up at interpreter shutdown.

PETSc.garbage_cleanup(PETSc.COMM_SELF) will do nothing. I contributed some code upstream that means that we only defer the destruction of objects whose communicator has size greater than 1.

To me it seems feasible that we could be caching PETSc objects in some of our global caches that only get cleared up at interpreter shutdown.

I think I did a round of pulling all of those out, so there are only "Object-cached" things that live for the lifetime of the process.

But, we absolutely have refcycles in the firedrake objects, so to clean things up one does need gc.collect() followed by garbage_collect on the relevant communicator.

As @jrmaddison notes, it is insufficient to call collect at the end of the run function, because the references to firedrake objects are still live. Without explicitly deleting (via del) the names, they don't go out of scope until the function exits. So one must send the communicator out of the run function, and then do as James suggests.

lindsayad commented 2 weeks ago

Changing to

    solve(a == L, w, solver_parameters=parameters)                                                                        

    return mesh._comm                                                                                                     

if __name__ == "__main__":                                                                                                
    comm = run()                                                                                                          
    gc.collect()                                                                                                          
    PETSc.garbage_cleanup(comm) 

does indeed resolve the issue, thanks! I think it woud be nice to incorporate this into documentation examples. Not many users may run with -log_view, but this also removes warnings like yaksa: X leaked handle pool objects