Closed cortner closed 1 year ago
@wcwitt -- I had to write a manual "scheduler" to assemble the lsq blocks in the right order. This becomes an issue when a dataset has structures of very different sizes. I don't know how pmap works, but with @threads this strange fix was needed. I wonder whether the bad performance in the distributed assembly for me is something similar.
@tjjarvinen --- my implementation in this PR sometimes hangs at the very end of the LSQ system assembly. Would you be willing to take a quick look and see if something jumps out at you? When I CTRL-C interrupt it, then it seems it got stuck waiting for the lock to become free.
(also is there a more elegant way to achieve this?)
follow-up I think I may have found the bug. testing now ... I'll let you know if I can't fix it after all.
I tried with this mt implementation and I feel much more pleasant to use this than the current ACEfit.assemble
. This does not have a overhead in Pkg.instantiate
and OOM problem. It is possible that we can have this as an experimental feature and merge it even if it is not user-facing yet?
CC @cortner @wcwitt
I'm ok with merging as is, but alternatively also to first integrate it into the user-interface.
@wcwitt wasn't too keen on this in the first place, but I think it is sufficiently useful for us that we can ask him again to consider it.
In the near future we should merge the dist and MT assembly into a single framework as discussed in other issues.
Letting you know that I've been working on this today, partly in response to @CheukHinHoJerry's experience.
a overhead in Pkg.instantiate
I don't understand how/why this would happen - can you elaborate?
OOM problem
Recently, I have been seeing this on large datasets with both the distributed and multithreaded. Not sure what changed, but it feels like a memory leak. Adding GC.gc()
after assembling each block alleviates things somewhat, although I don't know what it does to the performance.
I don't understand how/why this would happen - can you elaborate?
when you work interactively as we do, you first add the workers, then have to instantiate the environment on each worker. Then you might as well cycle to the next coffee shop and take a break before you can continue since that operation can take a long time.
Recently, I have been seeing this on large datasets with both the distributed and multithreaded. Not sure what changed, but it feels like a memory leak.
If gc helps, then I don't think it can be a memory leak? It is possible that during multi-threading the gc doesn't turn itself on as often. I haven't done any reading on this, but this has been my impression during benchmarking. Maybe I'm completely wrong.
@tjjarvinen can you please comment? Also in general, what are typical reasons to get OOMs in Julia? How is this even possible?
The main reason for OOM is that you allocate too much data. GC helps a little in these cases, because it clears out some unneeded data.
To me this issue sounds like that Julia is starting to use swap, which results as a slow down. Once swap is used up too, it will cause OOM.
Have you looked how much Julia is using memory (+ memory per process) and what is swap usage?
Also how much is the estimate of data usage for assembling?
when you work interactively as we do, you first add the workers, then have to instantiate the environment on each worker. Then you might as well cycle to the next coffee shop and take a break before you can continue since that operation can take a long time
I believe you, but I still don't understand. This rarely takes more than a few seconds for me interactively (and the exeflags
thing shouldn't be necessary since 1.9
) - when/why is an instantiate required?
> using Distributed
> using ACE1pack
> addprocs(50, exeflags="--project=$(Base.active_project())")
> @everywhere using ACE1pack
addprocs(50, exeflags="--project=$(Base.active_project())")
this is news to me, I was unaware of this.
Maybe we should just test this again more carefully. In principle if firing up 50 workers is more or less instantaneous then I we can discuss dropping the mt again.
when/why is an instantiate required?
Instantiate is not needed, if using the same node or when using Julia install that has access to the same local data (.julia folder). If you have different nodes that do not have access to same storage space, then you need to instantiate each worker.
that wasn't my experience. I don't know why but simply using
did not work for me. Maybe things have changed in the past few versions of Julia and I missed those changes.
So maybe one of you can just put together a script for us so we can try using multiple workers interactively. We will try and it and if it works fine we continue from there?
You've convinced me the mt is worthwhile ... as soon as I'm done experimenting with it we can merge. Otherwise, at this point I'm just making sure I understand your workflow - like whether your workers are somehow on another machine.
More importantly, I now understand the timing of this OOM stuff. I used to have this line
GC.gc() # not sure if this is necessary?
which I removed recently during some rearranging[54b7b2ed0a5ccb66820a799e43353f5979d68cfc]. I'll put it back. In principle, I don't think it should be necessary, but the forums are full of people complaining of parallel-related memory issues, such that it's a little hard to figure out what is current/relevant.
I looked the code in this PR and
while next <= length(packets)
# retrieve the next packet
if next > length(packets)
break
end
lock(_lock)
cur = next
next += 1
unlock(_lock)
if cur > length(packets)
break
end
p = packets[cur]
This part is attempting to reimplement Channels. Rather than trying to rediscover Channels, just use the existing Channels. It makes the code better in all the possible ways.
On the mt, I'm close to something I'm happy with (influenced by @tjjarvinen's recommendation of Folds.jl). Can we just pause that discussion until I'm done and then you can critique it from there.
that wasn't my experience. I don't know why but simply
using
did not work for me. Maybe things have changed in the past few versions of Julia and I missed those changes.
When you spawns new processes they will get the project dir from the host. But every process needs to load all the packages separately, so you need to start
using Distributed
addprocs(20)
@everywhere using List_of_packages
So maybe one of you can just put together a script for us so we can try using multiple workers interactively.
Our docs are actually relatively decent here - the "script" approach should work interactively. I don't say this to be annoying - if they aren't sufficient I will improve them. https://acesuit.github.io/ACE1pack.jl/dev/gettingstarted/parallel-fitting/
Our docs are actually relatively decent here ...
I've never seen this before. My mistake, sorry.
@tjjarvinen -- I wrote the part above that you quote. I agree with you of course. But remember this was a temporary hack and for me it is faster to write this than to learn about Channels. Let's see what Chuck comes up with and then discuss.
@wcwitt --- just to confirm that with your instructions above it becomes more convenient to assemble the LSQ system distributed instead of multi-threaded. The barrier is still slightly bigger, but not nearly as bad as it used to be. So from my end, we can make this low priority.
(also I can confirm that putting the gc() back into the distributed assembly prevented some OOMs for me just now. I tried with version of the package before and after...)
Thanks - I'm glad, but we're deep enough into this now that I'm going to try to finish it off.
Would someone please look at -- or possibly try -- this example? It requires the wcw/gc-test
branch of ACEfit, but I have merged this branch into that one. The Si PRX data are available here and I started Julia with julia -p 9 -t 10
. See this diff for the relevant changes to the assembly.
using Distributed
@everywhere using ACE1pack
dataset = "gp_iter6_sparse9k.xml.xyz"
data_keys = [:energy_key => "dft_energy",
:force_key => "dft_force",
:virial_key => "dft_virial"]
model = acemodel(elements = [:Si],
Eref = [:Si => -158.54496821],
rcut = 5.5,
order = 4,
totaldegree = 16)
data = [ AtomsData(at; data_keys..., v_ref = model.Vref)
for at in read_extxyz(dataset) ]
# distributed
ACEfit.assemble(data, model.basis) # to force compilation
time_mp_gc = @elapsed ACEfit.assemble(data, model.basis; do_gc = true)
time_mp_no = @elapsed ACEfit.assemble(data, model.basis; do_gc = false)
# threaded
ACEfit.mt_assemble(data, model.basis) # to force compilation
time_mt_gc = @elapsed ACEfit.mt_assemble(data, model.basis; do_gc = true)
time_mt_no = @elapsed ACEfit.mt_assemble(data, model.basis; do_gc = false)
@show time_mp_gc, time_mp_no
@show time_mt_gc, time_mt_no
For me, the results, after starting Julia with julia -p 9 -t 10
are
(time_mp_gc, time_mp_no) = (283.623841182, 255.898352052)
(time_mt_gc, time_mt_no) = (848.516291594, 253.484540468)
indicating a huge slowdown when I garbage collect from each thread.
interesting - your script managed to crash my laptop all the way to reboot ...
independent of my problem - can you look at Julia 1.10? I read somewhere that it has a multi-threaded GC. I wonder whether this solves your problem.
Here are my times on Julia 1.9:
(time_mp_gc, time_mp_no) = (275.686350292, 279.586481542)
(time_mt_gc, time_mt_no) = (535.976733791, 236.405905)
and on Julia 1.10:
(time_mp_gc, time_mp_no) = (262.532686458, 220.518536)
(time_mt_gc, time_mt_no) = (372.88618475, 386.81745025)
So interestingly, the distributed is hands-down the faster version. The GC could be kept on as default, but good to have an option to turn it off when user wants it?
I think all things considered - it's maybe good to keep distributed the default. Especially as we move towards multi-threaded acceleration of model evaluation.
Thank you very much for taking the time to do this. Naturally, I like the distributed and I would be happy for it to be default, but I remain unsettled by this: https://github.com/ACEsuit/ACEfit.jl/issues/49#issuecomment-1611453321.
Your observation that the threading performs better for very small configs seems correct, and I won't be satisfied with default-distributed until I've managed to resove that part.
Did you rerun this test above with latest ACEfit / ACE1pack / Julia 1.9 and 1.10?
Another thing we can do is feed training structures to processes in batches.
@wcwitt -- is this now obsolete - can we close it?
I'd rather keep it open for a bit longer, if that's okay.
I'm fine to close this now, if you still want to. Linking #49 for reference.
So the MT assembly is now in the package or removed again because of the poor performance?
To answer your question, it's not in the package, but it does live on in a branch. Eventually I will solve the performance dilemma.
For now this is just an experiment so we see what it might look like, the interface probably needs to be changed.