memory problem in Python interface with Mesh object

the-hampel commented 1 month ago

Comparing a simple Python script that uses nested objects with copies of triqs mesh objects seem to have a memory issue in the Python layer (probably not a memory leak) of TRIQS 3.3.x / unstable compared to 3.2.x!

Details

Consider the following script (uses only triqs and system libraries):

from copy import deepcopy
import os
import psutil
from triqs.gf.meshes import MeshImFreq

def process_memory():
    process = psutil.Process(os.getpid())
    mem_info = process.memory_info()
    return mem_info.rss

class SumkDFT():
    def __init__(self, mesh):
        self.mesh = mesh

class Solver():
    def __init__(self, sum_k):
        self.sum_k = deepcopy(sum_k)

def cycle():
    mesh = MeshImFreq(beta=40.0, statistic='Fermion', n_iw=10025)
    # mesh = np.linspace(0.0, 1.0, 10025)
    sum_k = SumkDFT(mesh=mesh)
    solver = Solver(sum_k)
    return

# now loop a lot and call cycle each time, every time a Solver object is created the memory increases!
print('mem in MB\n')
for j in range(200):
    for i in range(1000):
        cycle()
    print(f'{process_memory()/1024**2:.2f}')

Running this with triqs 3.2.x and 3.3.x gives vastly different memory footprints: Untitled

A few more observations:

removing the deepcopy call removes the problem, the problem seems to come from copying a py object with the mesh
using the simple commented line of a numpy array instead also removes the problem, so it has to do with triqs
the memory is monitored here with psutils but matches what you can observe with top
memory consumption scales definitely with mesh size! This is the object causing the memory growth

compiler info

clang 16
MKL
Python 3.10 and 3.11

It would be great if someone else could verify this. The problem is pretty bad with larger objects holding many mesh objects etc. As you can imagine from the naming of the mock objects here the problem occurred in triqs/solid_dmft as a severe memory problem, making nodes run out of memory when dealing with larger objects.

Alex

the-hampel commented 1 month ago

I just noted that this actually seems to be a problem of deepcopy , i.e. even this much simpler script gives the same strange memory behavior:

from copy import deepcopy
import os
import psutil
from triqs.gf.meshes import MeshImFreq

def process_memory():
    process = psutil.Process(os.getpid())
    mem_info = process.memory_info()
    return mem_info.rss

def cycle():
    mesh = MeshImFreq(beta=40.0, statistic='Fermion', n_iw=10025)
    mesh_2 = deepcopy(mesh)
    return

# now loop a lot and call cycle each time, every time a Solver object is created the memory increases!
print('mem in MB\n')
for j in range(200):
    for i in range(1000):
        cycle()
    print(f'{process_memory()/1024**2:.2f}')

I guess I should in general avoid using deepcopy (the mesh object has its own copy function), but I still have to identify where this happens in my original code.

the-hampel commented 1 month ago

The issue has been identified. The problem is in the creation of attributes as variable length strings in the h5 library, which is used in the (de-)serialization of objects in triqs. This is used when calling deepcopy, mpi.bcast etc. There seems to be a memory leak in the hdf5 version 1.12.3 and 1.14.x that has not been yet reported. The issue can be resolved by using hdf version 1.10.11 or older. However, the de-serialization of tuple objects is horribly slow via h5. This was introduced when switching from boost serialization to h5 in triqs 3.2.x. @Wentzell added a fix reverting back some of these changes for simple tuples that does not require boost on the test branch https://github.com/TRIQS/triqs/tree/DEV_SERIALIZATION giving tremendous speed (factor 10 or more) improvements over the current version.

We are currently preparing an issue for the hdf5 library but for now it is safer to avoid using the newer hdf versions!

TRIQS / triqs

memory problem in Python interface with Mesh object #952

Details

compiler info