performance vs CellListMap

lmiq commented 4 months ago

I'm opening this issue to perhaps provide some information concerning the discussion that we might have on Jul 7.

I've tested quickly here the computation of a nb list with NeighbourLists and CellListMap, and this is what I get:

julia> import CellListMap

julia> using NeighbourLists

julia> Threads.nthreads()
20

julia> x, box = CellListMap.xatomic(10^4); # sample coordinates with liquid water atomic density

julia> using Chairmarks

julia> @b PairList($x, $(box.cutoff), $(box.input_unit_cell.matrix), (true, true, true))
1.411 s (80186 allocs: 635.531 MiB, 1.21% gc time, without a warmup)

julia> @b CellListMap.neighborlist($x, $(box.cutoff); unitcell=$(box.input_unit_cell.matrix))
21.213 ms (6698 allocs: 236.658 MiB)

Am I'm doing something wrong?

Here, we have:

julia> box.cutoff
12.0

julia> box.input_unit_cell.matrix
3×3 StaticArraysCore.SMatrix{3, 3, Float64, 9} with indices SOneTo(3)×SOneTo(3):
 46.4159   0.0      0.0
  0.0     46.4159   0.0
  0.0      0.0     46.4159

The coordinates of the example are random coordinates within that box, and the density is the atomic density of liquid water.

cortner commented 4 months ago

I'm not entirely surprised.

my list is not actually multi-threaded
the list of pairs is not lazy but actually computed in memory; this is far more expensive than producing the cell list
my list does not assume min-image, nor any restrictions of cell shape, this creates a lot of overhead, some of that might be fixable

lmiq commented 4 months ago

I was unable to run now some comparison to matscipy. Do you think it is on the same ballpark there?

In this case CellListMap.neighborlist also returns the list of pairs in memory (it isn't lazy either). The difference here is that it returns a list of Tuple{Int,Int,Float64} with the indices and distances between the non-redundant pairs (that its, only for i > j). Concerning the cell shape, there aren't restrictions on the cell shape here either (except, in fact, some related to the size of the cell relative to the cutoff, the cutoff must not be to large relative to the cell sizes).

oh, but now I see what probably mean: you can compute pairlists for periodic systems at multiple images (such for instance for crystals many cells away). Indeed, that CellListMap does not do, it is mostly thought for amorphous systems. That's different indeed.

Single-threaded this is what I get:

julia> Threads.nthreads()
1

julia> x, box = CellListMap.xatomic(10^4); # sample coordinates with liquid water atomic density

julia> using Chairmarks

julia> @b PairList($x, $(box.cutoff), $(box.input_unit_cell.matrix), (true, true, true))
1.453 s (80091 allocs: 635.597 MiB, 5.99% gc time, without a warmup)

julia> @b CellListMap.neighborlist($x, $(box.cutoff); unitcell=$(box.input_unit_cell.matrix))
208.961 ms (721 allocs: 197.067 MiB, 36.59% gc time)

cortner commented 4 months ago

Let's keep it open. I think that there is scope to improve things at my end, and I think there is scope to share code or at least an interface that selects your cell-list if min-image and mine if not. And maybe that can be merged into a single package. I have no strong views on any of this.

My list is slightly slower than the C++ implementation James Kermode.

Teemu and some others are talking about re-implementing neighbourlists from scratch. I don't have a strong view on this either.

My only perspective is that I hate code-duplication and maintaining multiple packages doing the same thing - especially when the community is as tiny as JuliaMolSim. Apart from that I'm open to anything.

lmiq commented 4 months ago

Teemu and some others are talking about re-implementing neighbourlists from scratch. I don't have a strong view on this either.

Yes, exactly because of that I ended checking this. I'm interested in knowing what they are about (I guessed this was the implementation they were starting to develop). I have spent quite some time optimizing what's done in CellListMap because I need that for ComplexMixtures.jl, Packmol.jl, etc. CellListMap is more or less on the same ballpark as what NAMD does (but not scaling as effectively) on CPUs. I did not do anything in the sense of trying to compute these on GPUs, though. Anything that accelerates even further these calculations would be useful for me.

cortner commented 4 months ago

I probably won't be there at the meeting on 7 June, so here are a few comment:

I don't think there is a plan to rewrite or continue to develop NeighbourLists.jl. @tjjarvinen and others are writing a new code from scratch that is supposedly GPU-friendly. I believe the gist is to only do the Cell list and then compute the actual neighbours lazily.
I asked that the new nlist can produce atomic environments in batches, which can help speed up ML potentials
I'd be more than happy to give up this repo as an interface to different implementations, it seems to have the canonical naming for that.
I think NeighbourLists as implemented right now is too different from CellListMap to merge them. Maybe better to give the package a new name or possibly kept as a reference implementation for testing purposes.

EDIT : I don't think it would be too hard to multi-thread it. Somebody who knows what they are doing could probably do it in 1-2 days.

tjjarvinen commented 4 months ago

There is a need for GPU neighbourlist, by Molly team, CESMIX and we in Vancouver could use it too. This combined to that non of the current neighbourlist implementations work with GPUs, nor have their design taken into account GPU limitations (and strengths).

Also different potentials need to access neighbourlist differently and this causes some issue. We cannot e.g. use CellListMap to implement ACE potentials, as it is now, because of this.

The plan of the meeting is to go trough on what we would need for the neighbourlist, what limitations GPU bring to neighbourlist implementation and finally present an idea on how to implement performant GPU neighbourlist.

In practice it is an opportunity discuss how to collaborate on neighbourlist implementations and how to get started with GPU neighbourlist.

cortner commented 4 months ago

Also different potentials need to access neighbourlist differently and this causes some issue. We cannot e.g. use CellListMap to implement ACE potentials, as it is now, because of this.

I'll have to look into the code though and I suspect this is more than fixable...

how to collaborate on neighbourlist implementations

I think that part is important. Let's not go from maintaining 3 to 10 neighbourlists..

lmiq commented 4 months ago

When implementing CellListMap I did some research but find close to none information on how they are implemented in the MD packages. I think one needs to somewhat go through their codes to understand what do. Amber has a particularly fast GPU implementation. I hope you have better luck and talent than me.

tjjarvinen commented 4 months ago

There is even less information on GPU neighbourlist implementations. This is the main one I have been looking https://doi.org/10.1063/5.0018516

lmiq commented 4 months ago

To be true, upon reading that paper, it didn't become clear to me that they build the pair lists on the GPU.

tjjarvinen commented 4 months ago

That is true. The article (and the referenced articles) only mention some ideas about how to implement GPU neighbourlist. No true description on how to actually implement it. Thats why I said "even less information". But, atleast there are something.

lmiq commented 4 months ago

My impression is that they don't build the pair lists on the GPU, but discuss the way to update the lists on the GPU. The initial pair list, I think, is only built once, on the CPU.

That approach (which is a sort of sophisticated implementation of Verlet lists) is useful in the MD context because the positions of the particles do not change much from step to step. This is one kind of application, but not the most general one - for instance to analyze trajectories, where steps not very correlated structurally, that may not be useful, and the pair lists have to be reconstructed from scratch each time.

(I would take a look on what OpenMM does, some insights here, maybe: http://docs.openmm.org/latest/developerguide/06_opencl_platform.html#reordering-of-particles).

tjjarvinen commented 4 months ago

I am not suprised at all, if they do the intial cell list on CPU, as it is very problemtic with GPU. But, I have very good idea on how to do the initial cell list on GPU. Updating the cell list for consecutive MD steps is easy. But, if this is done then MD program needs to adopt velocities and forces to the cell list structure also, which is doable.

In any case I have an idea on how this all can be pulled together, and we can then discuss shall we go for it or will we do something else.

JuliaMolSim / NeighbourLists.jl

performance vs CellListMap #28