Managing teraconverter memory usage

chrisroat commented 4 years ago

I am using terastitcher 1.11.10. When I am using paraconverter to do the final conversion on a dataset, I see a large spike in memory usage. I am running a container on Google Cloud Kubernetes, and it causes the container to hit its memory capacity and be killed. I continue to raise the limits as far as I can reasonably get (100s of GB).

I originally was using paraconverter with mpirun on hdf5 files (IMS files). To try and isolate the problem, I dropped out of parallel mode and used teraconverter directly. I also pulled the relevant data out of the hdf5 file into tiff files.

One channel of data is 192 2048x2048x46 tiles of uint16, which is ~70GB of data. On disk, the compressed tiffs total 25GB. I measured the CPU/memory usage on my personal machine when running teraconverter, and see a fairly sustained use at 120GB during computation, which drops to 60GB when writing. Google Cloud's monitoring sees a spike at ~180GB, right at the transition from computation to writing. This amount of memory is attainable with a highmem 32 core GCP machine (208GB), though I will probably use a 64 core machine (416GB) to have extra headroom. But this seems wasterful, and is forcing me to parallelize by splitting my channels and sharing a single alignment solution across them.

My guess is that paraconverter is dividing up and processing workloads in a way that has additional overhead. Is there anyway to control the number of workload chunks, say so that it's equal to the number of output tiles? The workload chunking seems to be automatically determined, but I'd like the chunks to be even smaller than what it attempts.

Would you be willing to put paraconverter in the github repo, so I can try such a change?

chrisroat commented 4 years ago

I looked over the algorithm in the copy of paraconverter I have (please put it on github?), and found that by adjusting the number of threads I specify, I can force lots of smaller chunks. The key is to use a moderately large prime number of threads, I think? [I was using 4 before in my testing.]

iannellog commented 4 years ago

parastitcher and paraconverter are already on GitHub. At link

https://github.com/abria/TeraStitcher/wiki/Multi-CPU-parallelization-using-MPI-and-Python-scripts

in the Download section of that page there are the links.

The problem you observe is due to a constant that it is likely too big and forces the loading of more memory than is strictly needed. It was introduced for reducing the number of 'open' and 'close' operations on files that was a bottleneck in other contexts.

Please, try to substitute the executables at his link (they are for Linux which is your platform, right?):

https://unicampus365-my.sharepoint.com/:u:/g/personal/g_iannello_unicampus_it/ETSyd7Awc2xNmKIxtTNKzyQBB1yxUWoAiyZjisf75OlrRg?e=WkGdVb

and tell me if the memory occupancy reduces substantially. If it works I will update all distributions.

chrisroat commented 4 years ago

I recently was testing out these new binaries, but still see the memory spiking and causing my containers to crash. This time I am stitching tiff stacks, rather than IMS files, if that makes any difference.

iannellog commented 4 years ago

One question: are you using the option --resolutions? If yes which are the resolutions you are generating? (by the way: using TIFF should not make any difference) Another question: the dataset has the same dimensions (i.e. 192 2048x2048x46 tiles of uint16, which is ~70GB of data per channel)? How many channels you are stitching at the time now?

I try to explain why there is a memory spike when merging is performed. In order to merge tiles TeraConverter allocates a buffer to store whole slices of the final image (assuming a 10% overlap among tiles, in your case this would mean about 0.81 x 2048 x 2048 x 192 = 650000000 pixels per slice and per channel) Each pixel is stored as a single precision floating number, which means about 2.61 GBytes per slice and per channel. This is the minimum memory needed. In the executables of standard distribution, to reduce the number of times input files are opened and closed, I set a parameter to load 64 slices at the time. I understand that it is likely a too large value, but I forgot to change it. In your case are loaded 46 slices since you have only 46 slices in your dataset. This explains the huge spike of memory (120 GB) you observed.

The version I prepared for you should allocate a buffer for using just one slice, unless you are not asking for lower resolutions, in which case the number of slices loaded at the time are 2^n, where n=0 means the highest resolution. I cannot understand why you observe again a large spike, unless you ask for lower resolutions. However, as soon as I have time I will check that the version I prepared for you actually allocates a buffer for one slice when n=0.

I hope this can help.

chrisroat commented 4 years ago

I just noted I had an old version of my container with the older terastitcher binary. I will try again with the newer one - no need for you to do anything. Sorry about that.

On Sat, Sep 19, 2020 at 03:50 Giulio Iannello notifications@github.com wrote:

One question: are you using the option --resolutions? If yes which are the resolutions you are generating?

(by the way: using TIFF should not make any difference)

Another question: the dataset has the same dimensions (i.e. 192 2048x2048x46 tiles of uint16, which is ~70GB of data per channel)? How many channels you are stitching at the time now?

I try to explain why there is a memory spike when merging is performed.

In order to merge tiles TeraConverter allocates a buffer to store whole slices of the final image (assuming a 10% overlap among tiles, in your case this would mean about 0.81 x 2048 x 2048 x 192 = 650000000 pixels per slice and per channel)

Each pixel is stored as a single precision floating number, which means about 2.61 GBytes per slice and per channel.

This is the minimum memory needed.

In the executables of standard distribution, to reduce the number of times input files are opened and closed, I set a parameter to load 64 slices at the time. I understand that it is likely a too large value, but I forgot to change it. In your case are loaded 46 slices since you have only 46 slices in your dataset. This explains the huge spike of memory (120 GB) you observed.

The version I prepared for you should allocate a buffer for using just one slice, unless you are not asking for lower resolutions, in which case the number of slices loaded at the time are 2^n, where n=0 means the highest resolution. I cannot understand why you observe again a large spike, unless you ask for lower resolutions. However, as soon as I have time I will check that the version I prepared for you actually allocates a buffer for one slice when n=0.

I hope this can help.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/abria/TeraStitcher/issues/67#issuecomment-695198256, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIBDYOQELO7DXC7NS3U47TSGSEGBANCNFSM4K5DT2AQ .

chrisroat commented 3 years ago

Were the updates you put into the custom binaries ever checked into github? If not, can I request that they are added on a branch, if not the main?

I am building from source to pick up a patch for a bad pointer, and I am hitting memory limits again (if you recall, containers on cloud computing do not swap -- they are killed when they hit memory limits).

abria / TeraStitcher

Managing teraconverter memory usage #67