ACEsuit / ACEfit.jl

Generic Codes for Fitting ACE models
MIT License
7 stars 6 forks source link

Alternative Assembly Routine #56

Open cortner opened 1 year ago

cortner commented 1 year ago

This is to split off the alternative assembly suggestion of @tjjarvinen in #55 from the sendto issue.

cortner commented 1 year ago

From @tjjarvinen :

I would like to do chages for the whole parallel assembly.

Better way to do this would be to use a pipeline style parallelization

Do do this you need to do some thing like this

# jobs and results are RemoteChannels
# f is a function that is mean to be done parallel
function parallel_work(jobs, results, f)
    while true
        id, input = take!(jobs)
        x = f(input...)
        put!(results, (id,x))
    end
end

# Create RemoteChannels for the work
jobs = RemoteChannel(()->Channel(2*nworkers()))
results = RemoteChannel(()->Channel(2*nworkers()))

# Create a input feeding task
# as this is async it will go on background 
# and execution continues
@async for x in inputs
        # build some id for the job
        put!(jobs, (id, x) )
end

# This spawns parallel workers that will do the work
for p in workers()
        remote_do(parallel_work, p, jobs, results, function_to_execute)
end

# Collect results and assemble them together.
for _ in allwork
     id, result = take!(results)
     # assemble result to correct place
end

This is what I have done earlier here, if you like to look for a reference.

You have to options on movin basis. One is to have it move in the RemoteChannel. The other to send it to all workers before the calculation. Latter is probably the best option here.

Also in my opinnion, the multithreaded version should do the same thing. Only change RemoteChannel to Channel and remote_do to @spawn, everything else can be the same.

cortner commented 1 year ago

From @wcwitt :

Open to being persuaded, but I'm pretty reluctant to make this change without a compelling reason.

Main process spawns a async task to fill RemoteChannel for job items Spawn Workers listen job item RemoteChannel for work One worker gets a job it calculates it and deposits to result RemoteChannel Main process listens results RemoteChannel and assembles results together

This is functionally identical to the current approach (right?), but with 2-3x more code and less readable for non-experts.

You have to options on movin basis. One is to have it move in the RemoteChannel. The other to send it to all workers before the calculation. Latter is probably the best option here.

The latter is what we do currently, but it fails because of a Julia bug for some basis types. Hence this issue.

Again, I'm open to being persuaded - and if there are performance issues we need to address them - but I spent a fair bit of time weighing the alternatives before choosing this pattern and I don't (yet) see how this change could improve performance.

cortner commented 1 year ago

To me - what Teemu suggests looks natural and I like the idea of having an implementation that can naturally run distributed or multithreaded. But let's not rush this change then.

cortner commented 1 year ago

Something to keep in mind when we do this please: In the past with multi-threaded assembly we sometimes had the problem that writing into the LSQ matrix ended up being a huge overhead (and bottleneck!). We thought this might be related to memory layout but never fully explored it.

... and I'm noticing this again with the implementation in ACEfit - the processor load drops to around 100% (i.e. ~ 1 thread) even though all threads are still active...

cortner commented 1 year ago

I think the pmap implementation has a different potential problem: the training structures seem to get assigned randomly to the processes. This is potentially very bad. For performance the training structures need to be assigned to processes by size of the structures. (large ones first) But it is much better than the naive multi-threaded version, and I think this is why the performance gap between the two is so small.

julia> @everywhere f(n) = (sleep(1); (@info n));

julia> pmap(f, 1:16);
      From worker 5:    [ Info: 2
      From worker 2:    [ Info: 1
      From worker 3:    [ Info: 3
      From worker 4:    [ Info: 4
      From worker 3:    [ Info: 8
      From worker 4:    [ Info: 6
      From worker 5:    [ Info: 7
      From worker 2:    [ Info: 5
      From worker 4:    [ Info: 10
      From worker 3:    [ Info: 9
      From worker 2:    [ Info: 12
      From worker 5:    [ Info: 11
      From worker 3:    [ Info: 13
      From worker 4:    [ Info: 14
      From worker 5:    [ Info: 16
cortner commented 1 year ago

Let's look at #49 first

wcwitt commented 1 year ago

To me - what Teemu suggests looks natural and I like the idea of having an implementation that can naturally run distributed or multithreaded.

It's not that I find it unnatural - the basic pattern is quite sensible! My point is that it's nearly the same as an asyncronous map, for which both distributed and multithreaded implementations are available (although I haven't tested the latter much). And I prefer canonical routines, if feasible, over asking non-Julia-experts to wrap their brains around Channels.

The only substantive difference, as far as I can tell, is in this last part:

Main process listens results RemoteChannel and assembles results together

But do we really want to transmit all the matrix data back to the main process/thread for writing? Better, in my view, to ask the workers to do the writing themselves. No two processes/threads should ever touch the same memory, so data races aren't a concern (although I acknowledge the problems you've seen with the threaded).

Something to keep in mind when we do this please: In the past with multi-threaded assembly we sometimes had the problem that writing into the LSQ matrix ended up being a huge overhead (and bottleneck!). We thought this might be related to memory layout but never fully explored it.

Definitely agree this is something to keep in mind.

I think the pmap implementation has a different potential problem: the training structures seem to get assigned randomly to the processes. This is potentially very bad. For performance the training structures need to be assigned to processes by size of the structures. (large ones first)

But in recent versions we sort by size to address this? I don't love that solution, actually, because the cost seems to vary nonlinearly with structure size, meaning the estimated time remaining is nearly always an overestimate, but I don't have a better idea. (This is also why the latter phase of the assembly goes so quickly - the workers are zooming through the small structures.)

cortner commented 1 year ago

Hi Chuck - ok, no worries. Since you are clearly not too thrilled with this proposal, let's put it on ice for now. Let's explore my performance issues a bit first and then see if we need to come back to it.