WIP: Multi-threaded assembly

cortner commented 1 year ago

For now this is just an experiment so we see what it might look like, the interface probably needs to be changed.

cortner commented 1 year ago

@wcwitt -- I had to write a manual "scheduler" to assemble the lsq blocks in the right order. This becomes an issue when a dataset has structures of very different sizes. I don't know how pmap works, but with @threads this strange fix was needed. I wonder whether the bad performance in the distributed assembly for me is something similar.

cortner commented 1 year ago

@tjjarvinen --- my implementation in this PR sometimes hangs at the very end of the LSQ system assembly. Would you be willing to take a quick look and see if something jumps out at you? When I CTRL-C interrupt it, then it seems it got stuck waiting for the lock to become free.

(also is there a more elegant way to achieve this?)

cortner commented 1 year ago

follow-up I think I may have found the bug. testing now ... I'll let you know if I can't fix it after all.

CheukHinHoJerry commented 1 year ago

I tried with this mt implementation and I feel much more pleasant to use this than the current ACEfit.assemble. This does not have a overhead in Pkg.instantiate and OOM problem. It is possible that we can have this as an experimental feature and merge it even if it is not user-facing yet?

CC @cortner @wcwitt

cortner commented 1 year ago

I'm ok with merging as is, but alternatively also to first integrate it into the user-interface.

@wcwitt wasn't too keen on this in the first place, but I think it is sufficiently useful for us that we can ask him again to consider it.

In the near future we should merge the dist and MT assembly into a single framework as discussed in other issues.

wcwitt commented 1 year ago

Letting you know that I've been working on this today, partly in response to @CheukHinHoJerry's experience.

a overhead in Pkg.instantiate

I don't understand how/why this would happen - can you elaborate?

OOM problem

Recently, I have been seeing this on large datasets with both the distributed and multithreaded. Not sure what changed, but it feels like a memory leak. Adding GC.gc() after assembling each block alleviates things somewhat, although I don't know what it does to the performance.

cortner commented 1 year ago

I don't understand how/why this would happen - can you elaborate?

when you work interactively as we do, you first add the workers, then have to instantiate the environment on each worker. Then you might as well cycle to the next coffee shop and take a break before you can continue since that operation can take a long time.

cortner commented 1 year ago

Recently, I have been seeing this on large datasets with both the distributed and multithreaded. Not sure what changed, but it feels like a memory leak.

If gc helps, then I don't think it can be a memory leak? It is possible that during multi-threading the gc doesn't turn itself on as often. I haven't done any reading on this, but this has been my impression during benchmarking. Maybe I'm completely wrong.

@tjjarvinen can you please comment? Also in general, what are typical reasons to get OOMs in Julia? How is this even possible?

tjjarvinen commented 1 year ago

The main reason for OOM is that you allocate too much data. GC helps a little in these cases, because it clears out some unneeded data.

To me this issue sounds like that Julia is starting to use swap, which results as a slow down. Once swap is used up too, it will cause OOM.

Have you looked how much Julia is using memory (+ memory per process) and what is swap usage?

Also how much is the estimate of data usage for assembling?

wcwitt commented 1 year ago

when you work interactively as we do, you first add the workers, then have to instantiate the environment on each worker. Then you might as well cycle to the next coffee shop and take a break before you can continue since that operation can take a long time

I believe you, but I still don't understand. This rarely takes more than a few seconds for me interactively (and the exeflags thing shouldn't be necessary since 1.9) - when/why is an instantiate required?

> using Distributed
> using ACE1pack
> addprocs(50, exeflags="--project=$(Base.active_project())")
> @everywhere using ACE1pack

cortner commented 1 year ago

addprocs(50, exeflags="--project=$(Base.active_project())")

this is news to me, I was unaware of this.

Maybe we should just test this again more carefully. In principle if firing up 50 workers is more or less instantaneous then I we can discuss dropping the mt again.

tjjarvinen commented 1 year ago

when/why is an instantiate required?

Instantiate is not needed, if using the same node or when using Julia install that has access to the same local data (.julia folder). If you have different nodes that do not have access to same storage space, then you need to instantiate each worker.

cortner commented 1 year ago

that wasn't my experience. I don't know why but simply using did not work for me. Maybe things have changed in the past few versions of Julia and I missed those changes.

cortner commented 1 year ago

So maybe one of you can just put together a script for us so we can try using multiple workers interactively. We will try and it and if it works fine we continue from there?

wcwitt commented 1 year ago

You've convinced me the mt is worthwhile ... as soon as I'm done experimenting with it we can merge. Otherwise, at this point I'm just making sure I understand your workflow - like whether your workers are somehow on another machine.

More importantly, I now understand the timing of this OOM stuff. I used to have this line

GC.gc()  # not sure if this is necessary?

which I removed recently during some rearranging[54b7b2ed0a5ccb66820a799e43353f5979d68cfc]. I'll put it back. In principle, I don't think it should be necessary, but the forums are full of people complaining of parallel-related memory issues, such that it's a little hard to figure out what is current/relevant.

tjjarvinen commented 1 year ago

I looked the code in this PR and

while next <= length(packets)
             # retrieve the next packet 
             if next > length(packets)
                 break
             end
             lock(_lock)
             cur = next 
             next += 1
             unlock(_lock)
             if cur > length(packets)
                 break 
             end 
             p = packets[cur]

This part is attempting to reimplement Channels. Rather than trying to rediscover Channels, just use the existing Channels. It makes the code better in all the possible ways.

wcwitt commented 1 year ago

On the mt, I'm close to something I'm happy with (influenced by @tjjarvinen's recommendation of Folds.jl). Can we just pause that discussion until I'm done and then you can critique it from there.

tjjarvinen commented 1 year ago

that wasn't my experience. I don't know why but simply using did not work for me. Maybe things have changed in the past few versions of Julia and I missed those changes.

When you spawns new processes they will get the project dir from the host. But every process needs to load all the packages separately, so you need to start

using Distributed
addprocs(20)

@everywhere using List_of_packages

wcwitt commented 1 year ago

So maybe one of you can just put together a script for us so we can try using multiple workers interactively.

Our docs are actually relatively decent here - the "script" approach should work interactively. I don't say this to be annoying - if they aren't sufficient I will improve them. https://acesuit.github.io/ACE1pack.jl/dev/gettingstarted/parallel-fitting/

cortner commented 1 year ago

Our docs are actually relatively decent here ...

I've never seen this before. My mistake, sorry.

cortner commented 1 year ago

@tjjarvinen -- I wrote the part above that you quote. I agree with you of course. But remember this was a temporary hack and for me it is faster to write this than to learn about Channels. Let's see what Chuck comes up with and then discuss.

cortner commented 1 year ago

@wcwitt --- just to confirm that with your instructions above it becomes more convenient to assemble the LSQ system distributed instead of multi-threaded. The barrier is still slightly bigger, but not nearly as bad as it used to be. So from my end, we can make this low priority.

cortner commented 1 year ago

(also I can confirm that putting the gc() back into the distributed assembly prevented some OOMs for me just now. I tried with version of the package before and after...)

wcwitt commented 1 year ago

Thanks - I'm glad, but we're deep enough into this now that I'm going to try to finish it off.

wcwitt commented 1 year ago

Would someone please look at -- or possibly try -- this example? It requires the wcw/gc-test branch of ACEfit, but I have merged this branch into that one. The Si PRX data are available here and I started Julia with julia -p 9 -t 10. See this diff for the relevant changes to the assembly.

using Distributed
@everywhere using ACE1pack

dataset = "gp_iter6_sparse9k.xml.xyz"
data_keys = [:energy_key => "dft_energy",
             :force_key => "dft_force",
             :virial_key => "dft_virial"]
model = acemodel(elements = [:Si],
                 Eref = [:Si => -158.54496821],
                 rcut = 5.5,
                 order = 4,
                 totaldegree = 16)
data = [ AtomsData(at; data_keys..., v_ref = model.Vref)
            for at in read_extxyz(dataset) ]

# distributed
ACEfit.assemble(data, model.basis)  # to force compilation
time_mp_gc = @elapsed ACEfit.assemble(data, model.basis; do_gc = true)
time_mp_no = @elapsed ACEfit.assemble(data, model.basis; do_gc = false)

# threaded
ACEfit.mt_assemble(data, model.basis)  # to force compilation
time_mt_gc = @elapsed ACEfit.mt_assemble(data, model.basis; do_gc = true)
time_mt_no = @elapsed ACEfit.mt_assemble(data, model.basis; do_gc = false)

@show time_mp_gc, time_mp_no
@show time_mt_gc, time_mt_no

For me, the results, after starting Julia with julia -p 9 -t 10 are

(time_mp_gc, time_mp_no) = (283.623841182, 255.898352052)
(time_mt_gc, time_mt_no) = (848.516291594, 253.484540468)

indicating a huge slowdown when I garbage collect from each thread.

cortner commented 1 year ago

interesting - your script managed to crash my laptop all the way to reboot ...

cortner commented 1 year ago

independent of my problem - can you look at Julia 1.10? I read somewhere that it has a multi-threaded GC. I wonder whether this solves your problem.

cortner commented 1 year ago

Here are my times on Julia 1.9:

(time_mp_gc, time_mp_no) = (275.686350292, 279.586481542)
(time_mt_gc, time_mt_no) = (535.976733791, 236.405905)

and on Julia 1.10:

(time_mp_gc, time_mp_no) = (262.532686458, 220.518536)
(time_mt_gc, time_mt_no) = (372.88618475, 386.81745025)

cortner commented 1 year ago

So interestingly, the distributed is hands-down the faster version. The GC could be kept on as default, but good to have an option to turn it off when user wants it?

cortner commented 1 year ago

I think all things considered - it's maybe good to keep distributed the default. Especially as we move towards multi-threaded acceleration of model evaluation.

wcwitt commented 1 year ago

Thank you very much for taking the time to do this. Naturally, I like the distributed and I would be happy for it to be default, but I remain unsettled by this: https://github.com/ACEsuit/ACEfit.jl/issues/49#issuecomment-1611453321.

Your observation that the threading performs better for very small configs seems correct, and I won't be satisfied with default-distributed until I've managed to resove that part.

cortner commented 1 year ago

Did you rerun this test above with latest ACEfit / ACE1pack / Julia 1.9 and 1.10?

cortner commented 1 year ago

Another thing we can do is feed training structures to processes in batches.

cortner commented 1 year ago

@wcwitt -- is this now obsolete - can we close it?

wcwitt commented 1 year ago

I'd rather keep it open for a bit longer, if that's okay.

wcwitt commented 1 year ago

I'm fine to close this now, if you still want to. Linking #49 for reference.

cortner commented 1 year ago

So the MT assembly is now in the package or removed again because of the poor performance?

wcwitt commented 1 year ago

To answer your question, it's not in the package, but it does live on in a branch. Eventually I will solve the performance dilemma.

ACEsuit / ACEfit.jl

WIP: Multi-threaded assembly #54