Memory leak for matmul operation

dv-ai commented 3 years ago

Hi,

In my use case, I run several matmul operations and I observe that the memory used by my program grows very fast. I try to investigate why my memory growing like this and I saw that the matmul operation consumes more memory than I was expected.

I was expected that the memory available after the matmul operation will decrease by the amount of memory of the tensor computed by the matmul operation but It's not the case.

The following source code shows the issue.

Is there a memory leak somewhere? Did I do something wrong?

NB: If I have bigger tensor, the difference of memory is far greater. Sometimes, the difference can be more than 2x the memory expected.

import tenseal as ts
import numpy as np
import psutil
import gc

# Build context
bits_scale = 26
context = ts.context(
    ts.SCHEME_TYPE.CKKS,
    poly_modulus_degree=8192,
    coeff_mod_bit_sizes=[31, bits_scale, bits_scale, bits_scale, 31])

context.auto_rescale = True
context.auto_relin = True
context.generate_galois_keys()
context.global_scale = pow(2, bits_scale)

# Define two numpy tensor
x = np.random.random((100, 20)).astype(np.float32)
w = np.random.random((20, 16)).astype(np.float32)

# Measure the memory to encode numpy array to encrypted tensor
available_memory = psutil.virtual_memory()[1]
x_cryp = ts.ckks_tensor(context, x)
available_memory2 = psutil.virtual_memory()[1]
memory_used = available_memory - available_memory2

theoretical_used = 8192.0 * (31 * 2 + 3 * bits_scale) * np.prod(np.asarray(x.shape))
print("memory_used=" + str(memory_used / 1E9) + " | memory_used_theoretical=" + str(theoretical_used / 1E9))
# memory_used=2.110029824 | memory_used_theoretical=2.29376
# OK it's close

# Measure the memory available after matmul operation
z = x_cryp.mm(w)
gc.collect()
available_memory3 = psutil.virtual_memory()[1]
memory_used = available_memory2 - available_memory3
theoretical_used = 8192.0 * (31 * 2 + 3 * bits_scale) * np.prod(np.asarray(z.shape))
print("memory_used=" + str(memory_used / 1E9) + " | memory_used_theoretical=" + str(theoretical_used / 1E9))
# memory_used=2.171191296 | memory_used_theoretical=1.835008
# KO, it's not close

dv-ai commented 3 years ago

When I desactivate the auto rescale (context.auto_rescale = False), the memory is: memory_used=1.525137408 | memory_used_theoretical=1.835008

It seems that the memory issue coming from the rescale operation. Do you have any ideas why the rescale have so high memory usage?

youben11 commented 3 years ago

Hi! There is an inplace variant for the matmul operations (and all the others), you can call mm_ instead. I'm guessing that the non inplace operations makes the old ciphertext stay in memory, even if it's no longer referenced on the python side, not sure if it's because the CKKSTensor itself isn't freed, or that freeing it doesn't involve freeing the underlying ciphertext. We can confirm that if the inplace operation doesn't yield the same issue

dv-ai commented 3 years ago

Thanks you for your answer. I have the same issue with mm. The memory is not equal to the memory of the new tensor. With mm, even with auto_rescale=False, there is a issue with the memory.

To reproduce the issue with mm_:

import tenseal as ts
import numpy as np
import psutil
import gc

# Build context
bits_scale = 26
context = ts.context(
    ts.SCHEME_TYPE.CKKS,
    poly_modulus_degree=8192,
    coeff_mod_bit_sizes=[31, bits_scale, bits_scale, bits_scale, 31])

context.auto_rescale = True
context.auto_relin = True
context.generate_galois_keys()
context.global_scale = pow(2, bits_scale)

# Define two numpy tensor
x = np.random.random((100, 20)).astype(np.float32)
w = np.random.random((20, 16)).astype(np.float32)

# Measure the memory to encode numpy array to encrypted tensor
available_memory = psutil.virtual_memory()[1]
x_cryp = ts.ckks_tensor(context, x)
available_memory2 = psutil.virtual_memory()[1]
memory_used = available_memory - available_memory2

theoretical_used = 8192.0 * (31 * 2 + 3 * bits_scale) * np.prod(np.asarray(x.shape))
print("memory_used=" + str(memory_used / 1E9) + " | memory_used_theoretical=" + str(theoretical_used / 1E9))
# memory_used=2.110029824 | memory_used_theoretical=2.29376
# OK it's close

# Measure the memory available after matmul operation
z = x_cryp.mm_(w)
gc.collect()
available_memory3 = psutil.virtual_memory()[1]
memory_used = available_memory - available_memory3
theoretical_used = 8192.0 * (31 * 2 + 3 * bits_scale) * np.prod(np.asarray(z.shape))
print("memory_used=" + str(memory_used / 1E9) + " | memory_used_theoretical=" + str(theoretical_used / 1E9))
# memory_used=3.060092928 | memory_used_theoretical=1.835008
# KO, it's not close
# if auto_scale= False -> memory_used=2.77518336 | memory_used_theoretical=1.835008

carlosaguilarmelchor commented 3 years ago

Same issue here. I added a loop at the end of the code so that the multiplication is done in place multiple times

import tenseal as ts
import numpy as np
import psutil
import gc

# Build context
bits_scale = 26
context = ts.context(
    ts.SCHEME_TYPE.CKKS,
    poly_modulus_degree=8192,
    coeff_mod_bit_sizes=[31, bits_scale, bits_scale, bits_scale, 31])

context.auto_rescale = False
context.auto_relin = True
context.generate_galois_keys()
context.global_scale = pow(2, bits_scale)

# Define two numpy tensor
x = np.random.random((100, 20)).astype(np.float32)
w = np.random.random((20, 20)).astype(np.float32)

# Measure the memory to encode numpy array to encrypted tensor
available_memory = psutil.virtual_memory()[1]
x_cryp = ts.ckks_tensor(context, x)
available_memory2 = psutil.virtual_memory()[1]
memory_used = available_memory - available_memory2

theoretical_used = 8192.0 * (31 * 2 + 3 * bits_scale) * np.prod(np.asarray(x.shape))
print("memory_used=" + str(memory_used / 1E9) + " | memory_used_theoretical=" + str(theoretical_used / 1E9))
# memory_used=2.110029824 | memory_used_theoretical=2.29376
# OK it's close

# Measure the memory available after matmul operation
for i in range(0,3):
    x_cryp = x_cryp.mm_(w)
    gc.collect()
    available_memory3 = psutil.virtual_memory()[1]
    memory_used = available_memory - available_memory3
    theoretical_used = 8192.0 * (31 * 2 + 3 * bits_scale) * np.prod(np.asarray(x_cryp.shape))
    print("memory_used=" + str(memory_used / 1E9) + " | memory_used_theoretical=" + str(theoretical_used / 1E9))
# memory_used=3.060092928 | memory_used_theoretical=1.835008
# KO, it's not close
# if auto_scale= False -> memory_used=2.77518336 | memory_used_theoretical=1.835008

This is the output obtained with auto_rescale = True

➜  tmp python3 example.py
memory_used=2.070208512 | memory_used_theoretical=2.29376
memory_used=3.649069056 | memory_used_theoretical=2.29376
memory_used=4.695683072 | memory_used_theoretical=2.29376
memory_used=5.259608064 | memory_used_theoretical=2.29376

And with auto_rescale = False

➜  tmp python3 example.py
memory_used=2.139492352 | memory_used_theoretical=2.29376
memory_used=3.159371776 | memory_used_theoretical=2.29376
memory_used=3.132669952 | memory_used_theoretical=2.29376
memory_used=3.108048896 | memory_used_theoretical=2.29376

Apparently this is because SEAL creates ciphertexts inside allocated memory pools that are not released when ciphertexts are dereferenced (so that the memory pool can be used again when new ciphertexts are created), but SEAL only reuses these pools for new ciphertexts if they have the same size.

https://github.com/microsoft/SEAL/issues/241

When auto-rescale is on, the new ciphertexts are smaller and the old pool is kept but unused, and a new pool is created for these ciphertexts (note that the new pool size is smaller each iteration as the new, rescaled, ciphertexts are smaller). TenSEAL should probably not use memory pools as SEAL does if rescaling is enabled. Apparently there is a way to do this in C++:

https://github.com/awslabs/homomorphic-implementors-toolkit/issues/122

but I do not know if this is correct or stable.

Best,

Carlos

OpenMined / TenSEAL

Memory leak for matmul operation #317