Closed SebastianM-C closed 5 years ago
I wrote another benchmark comparing the integrator interface with the solve one and if I didn't make any mistakes I have showed that the problem is also present with the integrator interface. See: https://github.com/SebastianM-C/Benchmarks/blob/e808d3d6f3e049627f0fa117795fa1cf4aff1823/integ_vs_solve.ipynb
The relevant part from the above:
@btime solve($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@btime solve($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@btime solve($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false);
# 789.911 μs (38974 allocations: 1.21 MiB)
# 355.283 μs (17439 allocations: 463.05 KiB)
# 1.438 ms (108191 allocations: 2.76 MiB)
function integ_benchmark(prob; args...)
integ = init(prob; args...)
while integ.t < prob.tspan[2]
step!(integ)
end
end
@btime integ_benchmark($prob1, alg=Vern9(), abstol=1e-14, reltol=1e-14)
@btime integ_benchmark($prob2, alg=DPRKN12(), abstol=1e-14, reltol=1e-14)
@btime integ_benchmark($prob2, alg=KahanLi8(), dt=1e-2, maxiters=1e10)
# 897.680 μs (40428 allocations: 1.55 MiB)
# 379.758 μs (17909 allocations: 536.55 KiB)
# 1.563 ms (111196 allocations: 3.06 MiB)
tspan = (0., 100.)
prob1 = ODEProblem(ż, z0, tspan, p)
prob2 = DynamicalODEProblem(ṗ, q̇, p0, q0, tspan, p)
@btime solve($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@btime solve($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@btime solve($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false);
# 7.940 ms (386014 allocations: 11.80 MiB)
# 3.480 ms (173907 allocations: 4.43 MiB)
# 17.513 ms (1080083 allocations: 27.47 MiB)
@btime integ_benchmark($prob1, alg=Vern9(), abstol=1e-14, reltol=1e-14)
@btime integ_benchmark($prob2, alg=DPRKN12(), abstol=1e-14, reltol=1e-14)
@btime integ_benchmark($prob2, alg=KahanLi8(), dt=1e-2, maxiters=1e10)
# 8.980 ms (400491 allocations: 15.50 MiB)
# 3.749 ms (178553 allocations: 4.83 MiB)
# 18.589 ms (1110082 allocations: 29.61 MiB)
Note that those benchmarks were done on a different (slower) machine compared with the first ones
julia> versioninfo()
Julia Version 1.0.1
Commit 0d713926f8 (2018-09-29 19:05 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, haswell)
I am not sure why the benchmarks for the integrator interface give slower timings than the ones for the solve interface. I hope that the benchmark function did not introduce any (grave) performance problems.
Edit: updated the link to point to the relevant file version. I tried some modifications (see master), but I am not sure if I got it right.
I opened Julia with --track-allocation=user --inline=no
to analyze this issue, however, the only allocation that I found was produced by the derivative function itself.
julia> @timev ż(z0, p, 1.);
0.000006 seconds (5 allocations: 208 bytes)
elapsed time (ns): 6038
bytes allocated: 208
pool allocs: 5
If changing the timespan doesn't change the allocations in init
when doing save_everystep=false
, then I don't know if there's anything we can point to? We should check this.
I added another set of benchmarks with the init
and step!
separated. I am not sure if I didn't make any mistakes since I get some strange results.
See https://github.com/SebastianM-C/Benchmarks/blob/694db1bef9340c12b292714691ce74d8425dac42/integ_vs_solve.ipynb
With tspan = (0.,10.)
I get
@btime init($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false)
@btime step_integ(integ1, $tspan[2]) setup=(integ1=init($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false))
@btime init($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false)
@btime step_integ(integ2, $tspan[2]) setup=(integ2=init($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false))
@btime init($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false)
@btime step_integ(integ3, $tspan[2]) setup=(integ3=init($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false))
# 4.335 μs (88 allocations: 18.45 KiB)
# 3.354 μs (223 allocations: 6.98 KiB)
# 4.258 μs (95 allocations: 11.13 KiB)
# 1.114 μs (69 allocations: 1.80 KiB)
# 2.825 μs (79 allocations: 6.03 KiB)
# 120.168 μs (10810 allocations: 281.53 KiB)
and when I increase to tspan=(0.,100.)
@btime init($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false)
@btime step_integ(integ1, $tspan[2]) setup=(integ1=init($prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false))
@btime init($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false)
@btime step_integ(integ2, $tspan[2]) setup=(integ2=init($prob2, DPRKN12(), abstol=1e-14, reltol=1e-14, save_everystep=false))
@btime init($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false)
@btime step_integ(integ3, $tspan[2]) setup=(integ3=init($prob2, KahanLi8(), dt=1e-2, maxiters=1e10, save_everystep=false))
# 4.384 μs (86 allocations: 18.34 KiB)
# 5.858 ms (385920 allocations: 11.78 MiB)
# 4.300 μs (94 allocations: 11.03 KiB)
# 308.408 μs (19312 allocations: 502.92 KiB)
# 2.825 μs (78 allocations: 5.94 KiB)
# 14.674 ms (1080000 allocations: 27.47 MiB)
What I find strange is that in the first case the timings are suspiciously small compared with the rest of the benchmarks and in the second case they explode in the case of Vern9
.
Okay, that rules out the possibility of something wrong in init. We'd need to check the stepping and the derivative function. There is a possibility that stepping has a problem like https://github.com/JuliaLang/julia/issues/22255
using OrdinaryDiffEq
using StaticArrays
using BenchmarkTools
using Profile
@inline function ż(z, p, t)
@inbounds begin
@assert z isa SVector
A, B, D = p
p₀, p₂ = z[1], z[2]
q₀, q₂ = z[3], z[4]
return SVector{4}(
-A * q₀ - 3 * B / √2 * (q₂^2 - q₀^2) - D * q₀ * (q₀^2 + q₂^2),
-q₂ * (A + 3 * √2 * B * q₀ + D * (q₀^2 + q₂^2)),
A * p₀,
A * p₂
)
end
end
q0 = SVector{2}([0.0, -4.363920590485035])
p0 = SVector{2}([10.923918825236079, -5.393598858645495])
z0 = vcat(p0, q0)
p = (A=1,B=0.55,D=0.4)
tspan = (0., 1000.)
prob1 = ODEProblem(ż, z0, tspan, p)
solve(prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false);
@timev solve(prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false);
Profile.clear_malloc_data()
solve(prob1, Vern9(), abstol=1e-14, reltol=1e-14, save_everystep=false);
exit()
When I set tspan = (0., 100.)
I got the allocation info
Coverage.MallocInfo(16, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 276)
Coverage.MallocInfo(16, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 277)
Coverage.MallocInfo(16, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 281)
Coverage.MallocInfo(32, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 278)
Coverage.MallocInfo(80, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 174)
Coverage.MallocInfo(80, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 404)
Coverage.MallocInfo(96, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 378)
Coverage.MallocInfo(240, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 147)
Coverage.MallocInfo(320, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 116)
Coverage.MallocInfo(336, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 178)
Coverage.MallocInfo(400, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 6)
Coverage.MallocInfo(480, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/integrators/type.jl.mem", 2)
Coverage.MallocInfo(624, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/perform_step/verner_rk_perform_step.jl.mem", 640)
Coverage.MallocInfo(848, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 62)
Coverage.MallocInfo(3264, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/alg_utils.jl.mem", 361)
Coverage.MallocInfo(3328, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/interp_func.jl.mem", 4)
Coverage.MallocInfo(7232, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/integrators/type.jl.mem", 130)
, with tspan=(0., 1000.)
, I got
Coverage.MallocInfo(16, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 276)
Coverage.MallocInfo(16, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 277)
Coverage.MallocInfo(16, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 281)
Coverage.MallocInfo(32, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 278)
Coverage.MallocInfo(80, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 174)
Coverage.MallocInfo(80, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 404)
Coverage.MallocInfo(96, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 378)
Coverage.MallocInfo(240, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 147)
Coverage.MallocInfo(320, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 116)
Coverage.MallocInfo(336, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 178)
Coverage.MallocInfo(400, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 6)
Coverage.MallocInfo(480, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/integrators/type.jl.mem", 2)
Coverage.MallocInfo(624, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/perform_step/verner_rk_perform_step.jl.mem", 640)
Coverage.MallocInfo(848, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/solve.jl.mem", 62)
Coverage.MallocInfo(3264, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/alg_utils.jl.mem", 361)
Coverage.MallocInfo(3328, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/interp_func.jl.mem", 4)
Coverage.MallocInfo(7232, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/integrators/type.jl.mem", 130)
They are exactly identical.
So, there isn't anything extra allocated with a longer time span within the OrdinaryDiffEq.jl
package. Closing.
How are the allocations in Coverage.MallocInfo(624, "/home/scheme/.julia/dev/OrdinaryDiffEq/src/perform_step/verner_rk_perform_step.jl.mem", 640)
not dependent on the number of steps?
@YingboMa , if there's allocations in perform_step!
, how is it not dependent on the number of steps?
I started investigating this again and I observed something quite strange. For a simpler user function (lorenz) the problem disappears. MWE:
using OrdinaryDiffEq
using BenchmarkTools
using StaticArrays
function lorenz(u,p,t)
du1 = 10.0*(u[2]-u[1])
du2 = u[1]*(28.0-u[3]) - u[2]
du3 = u[1]*u[2] - (8/3)*u[3]
return SVector{3}(du1,du2,du3)
end
@inbounds @inline function ż(z, p, t)
A, B, D = p
p₀, p₂ = z[SVector{2}(1:2)]
q₀, q₂ = z[SVector{2}(3:4)]
return SVector{4}(
-A * q₀ - 3 * B / √2 * (q₂^2 - q₀^2) - D * q₀ * (q₀^2 + q₂^2),
-q₂ * (A + 3 * √2 * B * q₀ + D * (q₀^2 + q₂^2)),
A * p₀,
A * p₂
)
end
u0 = @SVector [1.0,0.0,0.0]
u = vcat(u0,u0)
p = (A=1,B=0.55,D=0.4)
q0 = SVector{2}([0.0, -4])
p0 = SVector{2}([10, -5])
z0 = vcat(p0, q0)
tspan1 = (0.0,10.0)
prob1_ok = ODEProblem(lorenz,u0,tspan1)
prob1_notok = ODEProblem(ż,z0,tspan1,p)
@btime solve($prob1_ok, Vern9(), save_everystep=false) # 49.999 μs (40 allocations: 9.89 KiB)
@btime solve($prob1_notok, Vern9(), save_everystep=false) # 58.199 μs (3170 allocations: 107.92 KiB)
tspan2 = (0.0,100.0)
prob2_ok = ODEProblem(lorenz,u0,tspan2)
prob2_notok = ODEProblem(ż,z0,tspan2,p)
@btime solve($prob2_ok, Vern9(), save_everystep=false) # 543.900 μs (40 allocations: 9.89 KiB)
@btime solve($prob2_notok, Vern9(), save_everystep=false) # 450.700 μs (25810 allocations: 815.42 KiB)
I tried a couple of other functions and it looks like it's somehow related to how complicated the user function is (number of operations?). The Henon system is not sufficient to trigger the problem
henon(z, p, t) = SVector(
-z[3] * (1 + 2z[4]),
-z[4] - (z[3]^2 - z[4]^2),
z[1],
z[2]
)
but if I extend lorenz or henon by writing the equations twice, I can reproduce the problem. For extending I used something like this
function extend2(f, ::SVector{M}) where M
@inbounds function eom(u,p,t)
idx1 = SVector{M}(1:M)
idx2 = SVector{M}(M+1:2M)
vcat(f(u[idx1],p,t),
f(u[idx2],p,t))
end
end
and I don't think it introduces problems.
Indeed, extending doesn't impact allocations. Using
function lorenz2(u,p,t)
du1 = 10.0*(u[2]-u[1])
du2 = u[1]*(28.0-u[3]) - u[2]
du3 = u[1]*u[2] - (8/3)*u[3]
du4 = 10.0*(u[2+3]-u[1+3])
du5 = u[1+3]*(28.0-u[3+3]) - u[2+3]
du6 = u[1+3]*u[2+3] - (8/3)*u[3+3]
return SVector{6}(du1,du2,du3,du4,du5,du6)
end
yields the same number of allocations (and reproduces the problem).
Since the henon
system doesn't present the problem, but the one for ż
does, I am inclined to say that the problem is not related to the number of equations (or length of the SVector
), but to something like the number of operations.
I tried @inline
and @noinline
with lorenz2
and I doesn't influence allocations
u0 = @SVector [1.0,0.0,0.0]
u = vcat(u0,u0)
tspan1 = (0.0,10.0)
prob1_2lm = ODEProblem(lorenz2,u,tspan1)
@btime solve($prob1_2lm, Vern9(), save_everystep=false)
# @inline 49.900 μs (200 allocations: 16.69 KiB)
# @noinline 53.000 μs (200 allocations: 16.69 KiB)
tspan2 = (0.0,100.0)
prob2_2lm = ODEProblem(lorenz2,u,tspan2)
@btime solve($prob2_2lm, Vern9(), save_everystep=false)
# @inline 536.101 μs (1796 allocations: 79.03 KiB)
# @noinline 571.701 μs (1796 allocations: 79.03 KiB)
@YingboMa I updated your script above to
using StaticArrays
using Profile
using BenchmarkTools
using OrdinaryDiffEq
function lorenz(u,p,t)
du1 = 10.0*(u[2]-u[1])
du2 = u[1]*(28.0-u[3]) - u[2]
du3 = u[1]*u[2] - (8/3)*u[3]
return SVector{3}(du1,du2,du3)
end
function lorenz2(u,p,t)
du1 = 10.0*(u[2]-u[1])
du2 = u[1]*(28.0-u[3]) - u[2]
du3 = u[1]*u[2] - (8/3)*u[3]
du4 = 10.0*(u[2+3]-u[1+3])
du5 = u[1+3]*(28.0-u[3+3]) - u[2+3]
du6 = u[1+3]*u[2+3] - (8/3)*u[3+3]
return SVector{6}(du1,du2,du3,du4,du5,du6)
end
const u0 = @SVector [1.0,0.0,0.0]
const u = vcat(u0,u0)
const tspan = (0.0,10.0)
# const prob = ODEProblem(lorenz,u0,tspan)
const prob = ODEProblem(lorenz2,u,tspan)
@time solve(prob, Tsit5(), save_everystep=false);
@time solve(prob, Tsit5(), save_everystep=false);
Profile.clear_malloc_data()
@timev solve(prob, Tsit5(), save_everystep=false);
exit()
and tried to debug with --track-allocations=user
, but it seems quite tricky due to inlining.
I wrote a very detailed log here: https://gist.github.com/SebastianM-C/578e46a7ccff39dabd4aeeac964c099c
Linear increase of allocations is worrisome. Where are all of the other allocs? That gist only displays 144 of like 700
I used the Coverage script and checked the top 5 memory allocation spots and I think I found something. The error estimator has a linear increase in allocations.
I was confused at first that in solve!
the perform_step!
function showed 0 allocations, but actually in the .mem
file you can see them.
I updated my findings here: https://github.com/SebastianM-C/Benchmarks/blob/7687fcddde85f908d238e9347b1e4c470e4ef3f1/debug_log_518.md
This contradicts @YingboMa 's earlier post, so I would appreciate if someone could try to replicate my findings in order to double check and make sure I didn't miss something.
TLDR:
with tspan = (0.0,10.0)
- @muladd function perform_step!(integrator, cache::Tsit5ConstantCache, repeat_step=false)
0 @unpack t,dt,uprev,u,f,p = integrator
0 @unpack c1,c2,c3,c4,c5,c6,a21,a31,a32,a41,a42,a43,a51,a52,a53,a54,a61,a62,a63,a64,a65,a71,a72,a73,a74,a75,a76,btilde1,btilde2,btilde3,btilde4,btilde5,btilde6,btilde7 = cache
0 k1 = integrator.fsalfirst
0 a = dt*a21
0 k2 = f(uprev+a*k1, p, t+c1*dt)
0 k3 = f(uprev+dt*(a31*k1+a32*k2), p, t+c2*dt)
0 k4 = f(uprev+dt*(a41*k1+a42*k2+a43*k3), p, t+c3*dt)
0 k5 = f(uprev+dt*(a51*k1+a52*k2+a53*k3+a54*k4), p, t+c4*dt)
0 g6 = uprev+dt*(a61*k1+a62*k2+a63*k3+a64*k4+a65*k5)
0 k6 = f(g6, p, t+dt)
0 u = uprev+dt*(a71*k1+a72*k2+a73*k3+a74*k4+a75*k5+a76*k6)
0 integrator.fsallast = f(u, p, t+dt); k7 = integrator.fsallast
0 integrator.destats.nf += 6
- if typeof(integrator.alg) <: CompositeAlgorithm
- g7 = u
- # Hairer II, page 22
- integrator.eigen_est = integrator.opts.internalnorm(k7 - k6,t)/integrator.opts.internalnorm(g7 - g6,t)
- end
0 if integrator.opts.adaptive
0 utilde = dt*(btilde1*k1 + btilde2*k2 + btilde3*k3 + btilde4*k4 + btilde5*k5 + btilde6*k6 + btilde7*k7)
0 atmp = calculate_residuals(utilde, uprev, u, integrator.opts.abstol, integrator.opts.reltol,integrator.opts.internalnorm,t)
1648 integrator.EEst = integrator.opts.internalnorm(atmp,t)
- end
0 integrator.k[1] = k1
0 integrator.k[2] = k2
0 integrator.k[3] = k3
0 integrator.k[4] = k4
0 integrator.k[5] = k5
0 integrator.k[6] = k6
0 integrator.k[7] = k7
0 integrator.u = u
- end
with tspan = (0.0,20.0)
- @muladd function perform_step!(integrator, cache::Tsit5ConstantCache, repeat_step=false)
0 @unpack t,dt,uprev,u,f,p = integrator
0 @unpack c1,c2,c3,c4,c5,c6,a21,a31,a32,a41,a42,a43,a51,a52,a53,a54,a61,a62,a63,a64,a65,a71,a72,a73,a74,a75,a76,btilde1,btilde2,btilde3,btilde4,btilde5,btilde6,btilde7 = cache
0 k1 = integrator.fsalfirst
0 a = dt*a21
0 k2 = f(uprev+a*k1, p, t+c1*dt)
0 k3 = f(uprev+dt*(a31*k1+a32*k2), p, t+c2*dt)
0 k4 = f(uprev+dt*(a41*k1+a42*k2+a43*k3), p, t+c3*dt)
0 k5 = f(uprev+dt*(a51*k1+a52*k2+a53*k3+a54*k4), p, t+c4*dt)
0 g6 = uprev+dt*(a61*k1+a62*k2+a63*k3+a64*k4+a65*k5)
0 k6 = f(g6, p, t+dt)
0 u = uprev+dt*(a71*k1+a72*k2+a73*k3+a74*k4+a75*k5+a76*k6)
0 integrator.fsallast = f(u, p, t+dt); k7 = integrator.fsallast
0 integrator.destats.nf += 6
- if typeof(integrator.alg) <: CompositeAlgorithm
- g7 = u
- # Hairer II, page 22
- integrator.eigen_est = integrator.opts.internalnorm(k7 - k6,t)/integrator.opts.internalnorm(g7 - g6,t)
- end
0 if integrator.opts.adaptive
0 utilde = dt*(btilde1*k1 + btilde2*k2 + btilde3*k3 + btilde4*k4 + btilde5*k5 + btilde6*k6 + btilde7*k7)
0 atmp = calculate_residuals(utilde, uprev, u, integrator.opts.abstol, integrator.opts.reltol,integrator.opts.internalnorm,t)
3616 integrator.EEst = integrator.opts.internalnorm(atmp,t)
- end
0 integrator.k[1] = k1
0 integrator.k[2] = k2
0 integrator.k[3] = k3
0 integrator.k[4] = k4
0 integrator.k[5] = k5
0 integrator.k[6] = k6
0 integrator.k[7] = k7
0 integrator.u = u
- end
Looks like it's fixed by https://github.com/JuliaDiffEq/DiffEqBase.jl/pull/348
I noticed that with out-of-place integration with
SVector
s the allocations increase with the integration time withsave_everystep=false
. MWE:and increasing the integration time
I also included the timings for the full solution for comparison. (See http://nbviewer.jupyter.org/github/SebastianM-C/Benchmarks/blob/master/parallel.ipynb?flush_cache=true for more details)