SENSEI-insitu / SENSEI

SENSEI ∙ Scalable in situ analysis and visualization
https://sensei-insitu.org
Other
21 stars 22 forks source link

Growing Memory Consumption with Endpoint + ADIOS2 #107

Open jwindgassen opened 1 year ago

jwindgassen commented 1 year ago

I am currently trying to run SENSEI with a simulation where I use ADIOS2 for collecting the data on a separate node and running the visualizations there. I noticed however, that the memory usage on the receiving node was increasing with every timestep. The case I simulated was not gigantic, the file written by PosthocIO were about 10GB per timestep, which is also round about the size the memory consumption increased with every timestep. After a few dozen steps the Endpoint crashed because no more memory could be allocated.

As far as I can tell this only happens on the receiving node. When I tried it before with visualizing on the simulating nodes I saw no concerning leaks whatsoever. In the case outlined above the increasing memory was also only visible on the receiving node, the 8 simulating nodes were pretty much constant.

I created a small test example with the oscillator miniapp:

example.slurm:

#!/bin/bash -x
#SBATCH --job-name=example
#SBATCH --nodes=5
#SBATCH --ntasks-per-node=8
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=6
...

# Loading modules
...
module load sensei/4.1.0-adios2-catalyst-5.10.1

export PROFILER_ENABLE=1
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

# Starting Simulation:
rm -rf info.sst
srun -N4 -n32 --cpu-bind=verbose oscillator -b 65536 -s "9999999 9999999 9999999" -p 100 --t-end 3 -j 6 -f transport.xml --sync periodic-3772.osc &

# Starting SENSEI Endpoint
srun -N1 -n1 --cpu-bind=verbose SENSEIEndPoint -t transport.xml -a analysis.xml &> "mpi-$SLURM_JOB_ID.endpoint" &

wait
rm -rf info.sst

transport.xml:

<sensei>
    <transport type="adios2" enabled="1" engine="SST" filename="info" frequency="1">
        <engine_parameters>
            verbose = 5
            RendezvousReaderCount = 1
            RegistrationMethod = File
            OpenTimeoutSecs = 300
            <!--DataTransport = RDMA-->
        </engine_parameters>

        <mesh name="mesh">
            <cell_arrays>data</cell_arrays>
        </mesh>
    </transport>
</sensei>

analysis.xml:

<sensei>
    <analysis type="PosthocIO" enabled="1" frequency="1" output_dir="./posthocIO" file_name="output" mode="paraview">
        <mesh name="mesh">
            <cell_arrays>data</cell_arrays>
        </mesh>
    </analysis>

    <analysis type="catalyst" enabled="1" frequency="1" pipeline="slice" array="data" association="cell" image-filename="./datasets/slice-%ts.png" image-width="1920" image-height="1080" />
</sensei>

I made the size of the oscillator inputs pretty large so you would actually be able to see if the memory would increase.

Running this setup, our jobreporting shows the following memory consumption for the Endpoint node: image In this case the memory increased around 168GB over the complete 30 minutes and the Endpoint received (according to the log) 344 timesteps. So the leaked memory should be about 500MB per timestep (The steps you see in the graph are is just the sampling rate of the job reporting (about once every 1-2 minutes, not actually the consumption per step!).

SENSEI Version: 4.1.0 ADIOS2 Version: 2.7.1

I already tried going in with a debugger, but I could not find any immediate obivous cause for this. I will try to play around with the ADIOS2 parameters a bit and report any further discoveries.

Thanks in advance, ~Jonathan

jwindgassen commented 1 year ago

So it seems the only relevant ADIOS2 option is QueueLimit, but not for a reason regarding the memory leak: Once you set QueueLimit = 1 the memory will not increase anymore. But also, the Endpoint does not receive any data at all, the log is just dead.

I took a closer look at the case when I set QueueLimit = 0 and I noticed, that here the Endpoint also does not receive any actual data and no output (Catalyst slice or PosthocIO) is generated. The output of the endpoint looks like this:

STATUS: [0][.../SENSEI-4.1.0/endpoints/SENSEIEndPoint.cpp:94][v4.1.0]
STATUS: Processing time step 0 time 0
Reader 0 (0x12d02c0): Received a Timestep metadata message for timestep 1, signaling condition
Reader 0 (0x12d02c0): Received a Timestep metadata message for timestep 2, signaling condition
Reader 0 (0x12d02c0): Received a Timestep metadata message for timestep 3, signaling condition
Reader 0 (0x12d02c0): Received a Timestep metadata message for timestep 4, signaling condition
Reader 0 (0x12d02c0): Received a Timestep metadata message for timestep 5, signaling condition
...

The rest of the messages remain identical until the time limit is reached. So I would assume the Endpoint only ever receives the metadata and not the simulated data.

This would explain why no memory leak can be seen with QueueLimit = 0, as there will be no metadata send.

Unfortunately, I can't see a reason why there is no actual data send. It might just be a misconfiguration on my side.

burlen commented 1 year ago

Hi @jwindgassen can you reproduce this with just the PosthocIO back end? Having two back-ends makes it harder to say what's going on. The PosthocIO is the simpler of the two.

burlen commented 1 year ago

I've extensively profiled the adios codes and at that point there were no leaks. Since that time we've moved to SVTK and this required an additional conversions for both Catalyst and VTK I/O. This conversion from SVTK to VTK may be one potential source of a leak. I'll see if I can eliminate that.

burlen commented 1 year ago

oscillator + VTKPosthocIO runs cleanly in valgrind. I think the leak is not in the VTKPosthocIO class.

burlen commented 1 year ago

I found a couple of bugs (not memory leak) that will impact you if they have not yet are fixed in #109

burlen commented 1 year ago

I profiled oscillator -> adios2 sst --> endpoint --> vtk posthoc i/o with valgrind on both producer and consumer ends and found only 1 leak, it is in adios2 (I'm using 2.8.3 the latest official release). It is a minor leak and would not account for what is reported above.

Of course what you're reporting may not be a leak, it may be an accumulation of memory that is in fact properly released at the program end. In that case we'll have to use a heap profiler to track the source down. I'd strongly suspect ADIOS in that case because there's nothing I know of in SENSEI that would accumulate memory.

That said, we still need to check the catalyst slice for leaks, or see if you can reproduce the issue without the catalyst slice. ruling out catalyst (make the run without it) is probably easier than using the heap profilier

jwindgassen commented 1 year ago

Hi @jwindgassen can you reproduce this with just the PosthocIO back end? Having two back-ends makes it harder to say what's going on. The PosthocIO is the simpler of the two.

Yes. This Problem arises even if I turn the Catalyst Analysis off. Also I could not see any significant difference in the amount of increasing memory.


I profiled oscillator -> adios2 sst --> endpoint --> vtk posthoc i/o with valgrind on both producer and consumer ends and found only 1 leak, it is in adios2 (I'm using 2.8.3 the latest official release). It is a minor leak and would not account for what is reported above.

Of course what you're reporting may not be a leak, it may be an accumulation of memory that is in fact properly released at the program end. In that case we'll have to use a heap profiler to track the source down. I'd strongly suspect ADIOS in that case because there's nothing I know of in SENSEI that would accumulate memory.

That said, we still need to check the catalyst slice for leaks, or see if you can reproduce the issue without the catalyst slice. ruling out catalyst (make the run without it) is probably easier than using the heap profilier

This was also what we thought might be more likely problem here, especially considering, that SENSEI mostly uses shared_ptrs, etc. But do not have enough knowledge regarding ADIOS to make any comment from that side.