Applying @time to simulate reveals discrepancies on resources and timing reports

gsahonero commented 3 months ago

Hi,

I simulated some sequences using simulate and found that the timed data differed from what @time provides. Check this screenshot for reference:

There, the last line refers to the data that @time produces, and the previous one refers to what simulate provides. To reproduce this, the following code can be used:

using KomaMRI

x_range = [0]
y_range = [0]
z_range = [0]
T1_range = [1000e-3]
T2_range = [35e-3]
T2s_range = [20e-3]
rho_range = [1]
obj = Phantom{Float64}(name="Test phantom", x = x_range, y = y_range, z = z_range, T1 = T1_range, T2 = T2_range, T2s = T2s_range, ρ=rho_range);

seq = read_seq("...\\ge.seq")

plot_seq(seq)
sys = Scanner()
@time raw_x = simulate(obj, seq, sys)

I think the differences are due to other functions unrelated to run_sime_time_iter which is the one with @time on the simulate function, but it would be useful to trace them and check what could be optimized even further (~2GB is way much larger than 300 MB!). Also, it would be useful to clarify that the timing information returned by simulate considers only the simulation. Thinking of end-users, they might get confused - like me - by the reported and the perceived time.

Best, Guillermo

cncastillo commented 3 months ago

As you said, there are some preprocessing steps before the simularion that are not accounted for in the reported time. As run_sime_time_iter is the actual simulation I think is fair to time that function, if the preprocessing is taking 2 seconds that is a problem, and needs to be optimized.

Can you send me the sequence you are using?

gsahonero commented 3 months ago

I used the pulseq gradient echo sequence in the examples folder for the screenshot I shared. However, I have two other sequences with different ADC durations that show increased times and resources too. I'm attaching them here:

sequences.zip

The zip contains a JLD2 file with a dictionary including two sequences.

gsahonero commented 2 months ago

Hello there,

By checking simulate, this line seems to take a lot of time by itself:

if sim_params["gpu"] GC.gc(true); CUDA.reclaim() end

Since I'm using a dedicated GPU, it makes sense to have it executed all the time. CPU-only users don't need it and should not experience these significant timing discrepancies.

In any case, I wonder whether that line can be removed or skipped given that according to https://cuda.juliagpu.org/stable/usage/memory/#Garbage-collection, "There is no need for manual memory management, just make sure your objects are not reachable (i.e., there are no instances or references)."

cncastillo commented 2 months ago

We could just remove that line, maybe it doesn't help. If you do a for loop of simulate's, and there is no memory problems without without that line, I would say we can just remove it. Related to #125.

cncastillo commented 2 months ago

@gsahonero can you report if this did not generate any problems?

If you do a for loop of simulate's, and there is no memory problems without without that line, I would say we can just remove it. Related to https://github.com/JuliaHealth/KomaMRI.jl/issues/125.

gsahonero commented 1 month ago

This past month, I have been using the modified version (without the garbage collection function) of simulate. I didn't find any problem. In any case, the discrepancies between reports are still there, but they no longer show that large gap. I think the documentation should state that the reported @time information belongs to the simulation only and that other processes are around too, i.e., state a clarification.

cncastillo commented 1 month ago

Where would you put it in the docs? in the docstring for simulate?

gsahonero commented 1 month ago

Yes! Actually, after checking https://juliahealth.org/KomaMRI.jl/dev/reference/3-koma-core/#KomaMRICore.simulate, I can see that there is no mention of the @time prints. Understanding what is happening with simulate prints could be useful.

cncastillo commented 1 month ago

Sounds like a good first PR :smile: (You can do it directly on GitHub by pressing the :pencil2: icon)

https://github.com/JuliaHealth/KomaMRI.jl/blob/master/KomaMRICore/src/simulation/SimulatorCore.jl

gsahonero commented 1 month ago

The PR is there: #431 :)

Tooine commented 1 month ago

Hi,

I'm using KomaMRI to simulate MR Fingerprinting sequences. For such long sequences with multiple pulses, I observe a much longer preprocessing time than the simulation time, either on CPU or GPU.

For instance, on the simulation below, I get a simulation time of 9min12 while the total run is 49min, giving a preprocessing time of about 40min! I used the following MRF sequence with 70 pulses: 2D_MRF_bSSFP_TR21_TE5_10_15_inv.zip My phantom has about 1,000,000 points but it does not seem to explain the long preprocessing time (which is much less if I simulate with smaller sequences). In that case, it seems that the longest task is the preprocessing of the sequence (either seqd = seqd |> f32 or seqd = seqd |> gpu?).

I don't know how this could be optimized on a single run of the simulate function. Yet, if I want to simulate several phantoms with the same sequence, with the current implementation I have to suffer this precompilation time for each phantom. Maybe, there might be a way to modify the implementation to separate the precompilation from the simulation, in which case the sequence preprocessing could be done only once? (Same for the opposite case where one deals with a very large phantom and several sequences: the phantom could be preprocessed only once).

Anyway, it is a pleasure to use KomaMRI! :-) Thanks to the team for developing this powerful tool!

Best, Antoine

PS1: My goal is to simulate much longer sequences (about 300 pulses) but it seems infeasible today as the preprocessing time is more than linear in the number of pulses: for 40 pulses I get approximately a 15 min preprocessing.

PS2: apart from that, when computing on CPU (with smaller sequences), I have trouble with parallel processing: an increase of the number of threads N_threads leads to an increase of the simulation time... I don't think it is linked, but maybe I'm facing technical issues that might also affect the precompilation of the simulate function...

cncastillo commented 1 month ago

Hi @Tooine, first of all, thanks for using Koma :smile:!

That definitely sounds like a problem. Thanks for attaching the sequence so we can investigate.

Are you using the newest version of all the Koma-related packages? (v0.8.X) We fixed a huge slowdown when the sequences had too many ADC points in KomaCore v0.8.3. If you cannot update to v0.8.X, that probably means that you need to update Julia to >=1.9.

The CPU problem could be that you did not open Julia with multiple threads julia --threads=auto; you can check with Threads.nthreads().

Let me know if that helps, if not, I will fix it :muscle:. @rkierulf Maybe this is a good sequence to test the performance improvements coming to v0.9.

cncastillo commented 1 month ago

I forgot to answer this:

Maybe, there might be a way to modify the implementation to separate the precompilation from the simulation, in which case the sequence preprocessing could be done only once

You can already do that in a few ways (without waiting for me to fix the issue):

Add the phantoms together obj_new = obj1 + obj2. The problem is that the signals of each phantom will be combined into one. You can fix this by using an approach similar to this PR https://github.com/JuliaHealth/KomaMRI.jl/pull/348 from @depedraza, where you define a new simulation method that does that. I haven't had time to include this in Koma yet, and sadly there is no documentation on how to extend the simulation methods. Basically, you would need to define:

struct MyMethod <: SimulationMethod end # You can add parameters to this

KomaMRICore.sim_output_dim(obj::Phantom{T}, seq::Sequence, sys::Scanner, sim_method::MyMethod) where {T<:Real}

KomaMRICore.run_spin_precession!(p::Phantom{T}, seq::DiscreteSequence{T}, sig::AbstractArray{Complex{T}}, M::Mag{T}, sim_method::MyMethod) where {T<:Real}

# Optional
KomaMRICore.run_spin_excitation!(p::Phantom{T}, seq::DiscreteSequence{T}, sig::AbstractArray{Complex{T}}, M::Mag{T}, sim_method::MyMethod) where {T<:Real}

and then

sim_param["sim_method"] = MyMethod()

Check KomaMRICore/src/simulation/Bloch/BlochSimulationMethod.jl (defines all functions) and KomaMRICore/src/simulation/Bloch/BlochDictSimulationMethod.jl (defines just a subset) as examples. You don't need to define all of the functions, only the ones you want to replace, Bloch() is the fallback in case the function is not defined.

[!IMPORTANT]
Note that you don't need to "touch" Koma's source code to extend it! You can do it completely externally.

Use the BlochDict simulation method and add the magnetizations externally to get the signal https://juliahealth.org/KomaMRI.jl/dev/explanation/3-simulation/#BlochDict-Simulation-Method
if instead of using simulate you use run_sim_time_iter! directly.

The first one would be the recommended method, as you have more flexibility. If you found the PR useful and can help (1) to clean it up so we can merge it or (2) help document the simulation method extensibility (it is a very powerful but unknown feature) as a tutorial (just create in examples/3.tutorials/lit-06-SimulationMethods.jl with a simple example), that would help greatly :smile:.

gsahonero commented 1 month ago

I agree with @cncastillo! You need to use the last version to reduce the simulation time. The speedup between 0.7x and 0.8x is significant.

Complementing

Use the BlochDict simulation method and add the magnetizations externally to get the signal https://juliahealth.org/KomaMRI.jl/dev/explanation/3-simulation/#BlochDict-Simulation-Method

In this case, simulate requires you to provide N spins to have N output signals (using return_type="mat" on sim_params). Each signal is an independent simulated signal (as if you were simulating a point-like phantom). Moreover, I don't think you would like to handle 1M output signals! So, you should think about your phantom again and understand how to use those N output signals for MRF.

Finally, please, check whether the GPU garbage collection is not affecting the performance when you use a GPU. For this, check this line if sim_params["gpu"] GC.gc(true); CUDA.reclaim() end in SimulatorCore.jl. If it exists and it is still being run, then comment that line, replace the simulate function with that modified version, and check whether performance is improved.

Tooine commented 1 month ago

Hi,

thanks for those quick and detailed answers!

First of all, the CPU problem is solved, I'm using Julia on a Jupyter Notebook with VS Code and indeed, it required a Julia environment with a specified number of threads. I'm not a Julia expert, so I missed this, thanks!

Are you using the newest version of all the Koma-related packages? (v0.8.X) We fixed a huge slowdown when the sequences had too many ADC points in KomaCore v0.8.3. If you cannot update to v0.8.X, that probably means that you need to update Julia to >=1.9.

Yes, I'm using KomaMRI v0.8.2 with Julia v1.10.2. Looking at my Julia packages, I see that I get 1) KomaMRICore v0.8.3 which is a sub-package of KomaMRI v0.8.2, 2) KomaMRICore v0.8.2 as a package by itself.

In my code, I only imported KomaMRI, but I don't know which version of KomaMRICore is called when using simulate. I though it was the option 1, but to be sure I updated option 2 to 0.8.3 and then relaunched a simulation.

The preprocessing time did not improve. Actually, it seems that the simulation time increased a little bit to 11min30, with very different gc time and compilation time ... (not sure I used exactly the same sequence, but the sequence length is the same)
if I interrupt the simulation, I see in the log that the simulation is using option 2, i.e., KomaMRICore as a separate package (which is quite unexpected I guess?).

So, in a nutshell (not sure I was very clear), I'm now sure to use KomaMRICore v0.8.3 and this doesn't improve performance.

Maybe, there might be a way to modify the implementation to separate the precompilation from the simulation, in which case the sequence preprocessing could be done only once

You can already do that in a few ways (without waiting for me to fix the issue):

Thanks for highlighting those possibilities. Currently, I'm more in the setting where I get one phantom and several sequences to test, but this might be interesting for me in the future (in which case I can try to submit a PR, but I don't know if i have the necessary Julia skills for this).

Finally, please, check whether the GPU garbage collection is not affecting the performance when you use a GPU. For this, check this line if sim_params["gpu"] GC.gc(true); CUDA.reclaim() end in SimulatorCore.jl. If it exists and it is still being run, then comment that line, replace the simulate function with that modified version, and check whether performance is improved.

Ok, I'm going to check this now. I can see this line in the simulate function but it is after the precompilation (and even after the simulation in fact) so I don't know if there is a chance of improvement?

gsahonero commented 1 month ago

Hi,

Just a quick answer: that GPU garbage collection line may take a long time and resources. I understand the line will be removed soon (@cncastillo, I can send a PR just for this if needed). If you want to look deeper at the time and resources to profile what is taking too long for your simulation, use @time in some lines of simulate inside SimulatorCore.jl and run your simulation (replacing the default version of the function). That will throw some light.

In any case, simulate is always slower the first time is used. Following calls should be faster.

cncastillo commented 1 month ago

Yeah I agree that the GC line is not worth checking.

I will profile your sequence during the weekend. (Btw for this I recommend to use @profview in VSCode instead a sprinkling @times's. I reaaaally suggest reading this website https://modernjuliaworkflows.github.io/ full of tips and tricks!)

My intuition is that the problem comes from seqd = discretize(seq), there are some low hanging fruits that I could tackle to speed it up.

JanWP commented 1 month ago

Hi all, I'm a colleague of @Tooine

I just looked into this and it seems the problem is in the discretization of the sequence, and more specifically in get_samples.

I think the issue may be the one encountered in this thread on the Julia Discourse.

Using the sequence attached by @Tooine, I see the following: @btime t_rf = reduce(vcat, T0[i] .+ times(seq.RF[1,i], :A) for i in range) 368.622 s (458569 allocations: 1029.27 GiB)

Whereas by modifying the line slightly according to one of the proposed alternatives in that thread, I get: @btime t_rf2 = reduce(vcat, collect(T0[i] .+ times(seq.RF[1,i], :A) for i in range)) 89.593 ms (377652 allocations: 325.26 MiB)

Disclaimer: I'm an absolute Julia Newbie, so even though I checked the results were identical for our case, there may be something I'm missing.

cncastillo commented 1 month ago

Oh god! That could make sense. I believe a similar problem was detected by @gabuzi and fixed in https://github.com/JuliaHealth/KomaMRI.jl/pull/220, but in that case, the problem was a little different.

Could you check if

 @btime t_rf3 = reduce(vcat, [T0[i] .+ times(seq.RF[1,i], :A) for i in range])

also fixes the problem?

We can push a fix for that very easily!

JanWP commented 1 month ago

IIRC I tried exactly that line too, and the result was off. My interpretation was that the shape of [...] and collect(...) seemed to be different, but I didn't look closer. I'll try to find some time to check again later today.

Tooine commented 1 month ago

Hey,

thank you all for your consideration of the issue, and thank you @JanWP for probably identifying its cause!

I didn't have many time to test the proposed solutions this day, and I will be off for a few weeks, so don't worry if the problem is not solved immediately.

cncastillo commented 1 month ago

I believe that it doesn't hurt to push the reduce(vcat, collect(...)) fix immediately.

We run automated benchmarking and will notice if it generates any performance/accuracy regression. I will push a branch with the fix using KomaMRICore 0.0.9-DEV (that has the benchmarks setup), and if everything looks good, backport it to KomaMRICore v0.8.4.

cncastillo commented 1 month ago

collect(...) and [...] are lowered to the same:

julia> Meta.@lower [i for i in 1:10]
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = 1:10
│   %2 = Base.Generator(Base.identity, %1)
│   %3 = Base.collect(%2)
└──      return %3
))))

julia> Meta.@lower collect(i for i in 1:10)
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = 1:10
│   %2 = Base.Generator(Base.identity, %1)
│   %3 = collect(%2)
└──      return %3
))))

Also

julia> @btime reduce(vcat, collect(ones(100) for i in 1:10_000));
  1.622 ms (10006 allocations: 16.33 MiB)

julia> @btime reduce(vcat, [ones(100) for i in 1:10_000]);
  1.622 ms (10006 allocations: 16.33 MiB)

I will now use [...], which looks a little cleaner.

cncastillo commented 1 month ago

Ok! The benchmarks are in! We can see a clear speed increase and reduced memory allocation size.

https://juliahealth.org/KomaMRI.jl/benchmarks/?option=memory

I just realized the changes are in KomaMRIBase, so I will tag a new non-breaking release for that sub-package (KomaMRIBase v0.8.5).

https://github.com/JuliaRegistries/General/pull/111028

JanWP commented 1 month ago

I'm a bit late to the party, but I can confirm that this fixes our issue. I put a @time on the call to discretize, and with the MRF sequence by @Tooine I'm using for testing, the difference is night and day:

Before commit e7bfc1f7edc46fb73a8c6c03798692b31d41f6e8, I get: 1827.977852 seconds (6.93 M allocations: 3.056 TiB, 18.10% gc time)

After the commit: 1.783379 seconds (6.22 M allocations: 1.614 GiB, 3.86% gc time)

Quite the improvement! Thanks for fixing this so quickly!

JuliaHealth / KomaMRI.jl

Applying @time to simulate reveals discrepancies on resources and timing reports #392