Multi-threaded parameter estimation

cortner commented 1 year ago

Distributed has a lot of overhead. I think we should return to providing also multi-threaded assembly of linear systems

wcwitt commented 1 year ago

Very open to this, but I would prefer to see some scalability tests first. Could be something for a student to do.

cortner commented 1 year ago

I was actually going to try myself because I'm struggling with the distributed assembly.

cortner commented 1 year ago

I wrote a first version of this - not yet exported. My distributed assembly problems are immediately gone and I can now assemble larger datasets fully utilising all cores I have. I'll make a PR so you can take a look.

There is a potential issue with how the slices of the design matrix are written into the design matrix using a lock. But something like that will always have to be done - distributed or not. This problem probably only arises when there are many very small structures. So then maybe one needs to write in batches...

wcwitt commented 1 year ago

This problem probably only arises when there are many very small structures

This makes me wonder if adjusting the pmap batch size argument might also fix your problem. I'm happy to merge a PR if it improves things, although if a one-size-fits-all distributed option can work I slightly prefer that

cortner commented 1 year ago

I appreciate that, but the other thing to keep in mind is that for people like me who don't work with clusters but large shared memory nodes, the overhead of working with processes is pretty annoying.

wcwitt commented 1 year ago

the overhead of working with processes is pretty annoying

Is it starting up the processes that is annoying? Or just the throughput? If the throughput, I think we might be able to address that by increasing the batch size to reduce communication latency. I also work on single nodes most of the time.

But by all means do submit the PR if important

cortner commented 1 year ago

Starting up really is the issue - and for some reason Julia seems to want to precompile everything again on every process. Otherwise I wouldn't mind too much.

wcwitt commented 1 year ago

I started to dig into this today. The first thing I wanted to test are drop-in threaded replacements for map from ThreadedIterables.jl and ThreadsX.jl. Some experiments in that direction are here and here.

I'm running some scalability tests now and will report the results here. Happy to do the same for your solution, and whatever else we come up with.

cortner commented 1 year ago

Make sure to test with datasets that have a lot of variation in structure size. I believe that's where we get the worst performance. A few huge structures and many unit cells.

wcwitt commented 1 year ago

Here's what I could manage today. It's for the PRX Silicon dataset (2475 configs of somewhat varying size), with a 4,16 basis (528978, 1445 design matrix). The threaded implementation is @cortner's ... the others I tried were comparable but slower.

Both plots are for the matrix assembly. The first plot includes Julia startup and compilation time (to look out for any extra distributed overhead). The second plot shows the time for a subsequent assembly (so, not affected by startup or compilation).

The main takeaways are how similar they are and that both hit diminishing returns fairly quickly. The distributed does seem to have marginally higher startup cost, but then slightly outperforms the threaded when that aspect is removed.

wcwitt commented 1 year ago

Do you want me to try one of your datasets? If the trend there is the same, then I suppose we need to look for system-dependent explanations?

cortner commented 1 year ago

Thank you for looking into this. I'll have to ask to share the dataset I'm using.

tjjarvinen commented 1 year ago

Starting up really is the issue - and for some reason Julia seems to want to precompile everything again on every process. Otherwise I wouldn't mind too much.

Because precompile does not work (for P4ML) Julia will compile everything for every process.

From what I read for performance above and here it seems that performance is limited by memory bandwidth. Most likely by random writes. This means that there is no difference between multithreading and multiprocessing.

In fact, if there is svd that takes a lot of time also. The option is to have multithreading with normal Lapack or multiprocessing with ScaLapack (this means also MPI). Basically having only multithreading is probably just easier.

cortner commented 1 year ago

seems that performance is limited by memory bandwidth

I didn't say this is always true and I would in fact be surprised if it is. I think this varies a lot between datasets.

cortner commented 1 year ago

In fact, if there is svd that takes a lot of time also

The SVD comes later. This discussion is only about assembling the design matrix. For most of our problems the SVD is much faster than the design matrix assembly.

cortner commented 1 year ago

Do you want me to try one of your datasets?

It was public after all. Would you mind trying this one?

https://archive.materialscloud.org/record/2022.102

If the performance is the same again, then I'll rerun your tests on my machine.

wcwitt commented 1 year ago

Sure - can you give me rough basis settings? Or even just the matrix size?

tjjarvinen commented 1 year ago

We should add proper benchmarking using PkgBenchmark.jl. I can add basics for of it, if you give me an example input. After that you can add more tests on the file.

cortner commented 1 year ago

TDEGS_4 = [ 
      [ 18, 14, 10, 6 ],  # 183
      [ 20, 16, 12, 8 ],  # 314 
      [ 22, 18, 14, 10],  # 539 
      [ 24, 20, 16, 12 ], # 929
      [ 26, 22, 18, 14 ], # 1609 
      [ 28, 24, 20, 16 ], # 2773 
      [ 30, 26, 22, 18 ], # 4789
   ]

make_model(tdeg; 
         transform = (:agnesi, 2, 4),
         pair_transform = (:agnesi, 1, 4),
         pure = false, ) = 
   ACE1x.acemodel(; 
         elements = [:Fe], 
         order = length(tdeg), 
         totaldegree = tdeg,
         transform = transform, 
         pair_transform = pair_transform,
         pure = pure, 
         rcut = r_cut,
         Eref = Dict(:Fe => -3455.6995339),  )

make_model(TDEGS_4[6])

cortner commented 1 year ago

That's the model that should work.

The one that failed is

make_model(TDEGS_4[3]; pure=true)

Don't do this with level larger than 3 since generating the model then takes very very long.

tjjarvinen commented 1 year ago

I started to dig into this today. The first thing I wanted to test are drop-in threaded replacements for map from ThreadedIterables.jl and ThreadsX.jl.

Use Folds.jl rather than ThreadsX.jl. Both use Transdusers.jl on the background, but Folds.jl is newer and can use distributed excecution also.

wcwitt commented 1 year ago

Ah, thank you! I was wondering about that, actually.

wcwitt commented 1 year ago

@cortner, I can confirm your distributed vs threading observation on the iron dataset, with one caveat, which is that I can't find the original data (from the earlier paper - whose links to the data appear broken), so I am only fitting to the "extensions" in the paper you linked (2306 configs, matrix size 36725,2773). Here are plots analogous to the above. I can't really explain them at this point.

wcwitt commented 1 year ago

We should add proper benchmarking using PkgBenchmark.jl.

I'm open to this, but don't have time to clean things up at the moment

cortner commented 1 year ago

that's much much smaller than the full dataset I'm working with, but perfectly fine if it repreduces the problem :)

wcwitt commented 1 year ago

I spent most of last night and today (without much success) trying to improve the parallel scaling of assembly, looking at both multiprocess and multithreaded. In particular, I set a goal to reduce assembly time for the silicon dataset (same model parameters as in the plots above) to below one minute using 50 threads and/or processes. I tried a lot of things, but I never got below 100 seconds.

A few observations:

At first, I thought the bottleneck might be that we have all processors/threads write to a single matrix, but that's wrong. If I simply compute the matrix elements and throw them away immediately - eliminating any synchronization concerns - the timings don't improve.
I expected threaded evaluation of the assembly loop over data (without writing/saving any of the results) to be extremely fast and parallelizable - but not so. I believe the explanation is related to garbage collection, as I saw high GC times; see also www.github.com/JuliaLang/julia/issues/32304.
I learned about CachingPools for pmap, which didn't solve my problem but may be useful in the future (www.github.com/JuliaLang/julia/pull/33892).
I did see some improvement from batching, although it wasn't dramatic.

All together, I think I will put add a batch_size option to the main assembly routine, but I'm running out of ideas for dramatic improvement.

Edit: I should also try with a newer Julia version that has multi-threaded GC.

tjjarvinen commented 1 year ago

That means that the reason for the bad performance is in feature_matrix, target_vector and/or weight_vector execution. They do calculate energy, forces and virials? So, it might be that those are the cause the issues.

Do those energy, forces and virial calls go back to JuLIP?

cortner commented 1 year ago

They do - and in the past we ensured that the JuLIP calculators switch to single-threaded mode. But maybe we need to double-check this here.

tjjarvinen commented 1 year ago

Also with JuLIP retirement incoming, we need to reimplement them to at some point. So, we could consider doing it now.

cortner commented 1 year ago

That's a very risky and huge change. We cannot do this before publishing. The problem is that JuLIP.Atoms deeply permeates ACE1.jl, ACE1x.jl and ACE1pack.jl.

Off-topic : ... but please do come up with a plan how this transition could be managed.

wcwitt commented 1 year ago

That means that the reason for the bad performance is in feature_matrix, target_vector and/or weight_vector execution.

Agree - at one point I confirmed it was basically all feature_matrix. If I were going to push this further (not likely in the short term), I would probably try to reduce the allocations there

ACEsuit / ACEfit.jl

Multi-threaded parameter estimation #49