Open rkurchin opened 3 weeks ago
I'm pro a canonical implementation that covers most use cases. I would only want it to be in the interface repo if it had negligible include time though. I don't see the harm in it being in a separate package.
The challenge would be agreeing on some minimal required properties, and yes I think there needs to be some way to store arbitrary data. But there needs to be at least some required properties, otherwise we end up defining an empty struct.
I also think a canonical implementation outside AtomsBase would be good. For me important would be:
My feeling is that convenience (e.g. mutability, annotation of bond properties etc.) and efficiency/suitability for AD might be contradicting. So maybe we should in the long run even have two implementations, one flexible, one fast ? That makes it easy to convert to/from these things whenever coding generic algorithms like GeometryOptimization does ? Would of course be good if this fragmentation can be avoided.
+1 for arbitrary properties. This has been critical to the success of ase.Atoms
.
Apparently Ask is not that happy with it :)
I'm somewhat late for the discussion, but maybe I have something to add. I have been recently implementing the reading o mmCIF files in PDBTools.jl because I needed to store and read data for simulation files of 70M atoms. For that, apart from having to optimize the reading of the file, I had to remove a custom dictionary from my Atom
data structure, because itself was too large to be stored in memory for every atom, even if empty:
julia> d = Dict{Symbol,Any}()
Dict{Symbol, Any}()
julia> Base.summarysize(d) * 70 * 10^6 / 1024 / 1024
30441.2841796875
(that would be 30GB just to store empty dicts for custom properties).
The result was that my Atom
type, which previously always contained an empty Dict
, had to be become a parametric Atom{CustomType}
which defaults to Atom{Nothing}
where the custom
field of the struct is instantiated to nothing
. With that I can hold in memory the data for the 70M atoms.
As a side note, I kept my Atom
implementation mutable.
A major discussion point at the workshop yesterday was the idea of having an "official" or "canonical" implementation of the AtomsBase interface (somewhat analogous to
ASE.Atoms
in Python). Advantages would include:Some logistical points to discuss on this:
Implementation discussion points:
Going to explicitly ping @cortner, @jgreener64, and @jameskermode here as they either very actively participated in discussion or implemented something that came up in said discussion :) but anyone should chime in with thoughts!