Probe-Particle / ppafm

Classical force field model for simulating atomic force microscopy images.
MIT License
50 stars 19 forks source link

Suggest cleanup of geometry handling in the CPU version #198

Open mondracek opened 1 year ago

mondracek commented 1 year ago

As I mentioned in my comment to #52, the way the CLI version of our code handles the geometry parameters (params['gridA'], ... params['gridN'], lvec, nDim etc.) could use some cleanup. For starters, I propose we discuss this issue:

  1. What do we expect to be the role of gridA, gridB, gridC, and gridN parameters in the code? Do we want them to be just an optional input (in case the geometry is not specified by other means, in particular in a *.xsf or other input file)? In such a case, they can be safely forgotten (I do not mean literally deleted, just ignored) after the initialization stage. The actual lattice vectors and grid size should be consistently carried in variables like lvec and nDim throughout the actual computation. Alternatively, do we want parms['gridA'], ... params'['gridN'] to serve as a global variable? Then, we should consistently work with those. In the most extreme version of this choice, we would not even need to introduce lvec and nDim variables at all. I can somehow understand that a compromise might be desirable: Perhaps we want some functions to be also usable as stand-alone entities, without being called from the main scripts. A moderately experienced user might want to do something in an interactive Python session. The loadGeometry function is one of those we may want to be able to call this way. Than, it makes sense for such a function to take lvec, 'nDim` etc. as arguments and store them as its local variables. On the other hand, some other functions may be intended as mere subprograms of the main executable scripts and need access to global variables, since passing everything they need through arguments to them would be inefficient and would generate too long argument lists. I suspect this compromise was what @ProkopHapala had in mind when originally writing the code. Prokop, is that so? I am actually in favor of the compromise solution, given that we decide very carefully which functions should be the independent entities, taking geometry parameters as their arguments and perhaps returning them as their values, which functions should in contrast access the global variables, and why the chosen option is needed in every case. Because if we are not clear about this, we inevitably end up with the mess we currently have.
  2. As a second point, I would like to make sure that the code reads the origin of the unit cell from the *.xsf and *.cube files every time, stores it and handles accordingly. That is, the code should not rely on the origin being zero. We might even introduce an optional parameter params['gridOrigin'] to params.ini, which would be used with *.xyz files. That should solve the problem with negative coordinates of atoms in non-periodic systems, which is currently effectively forbidden.
mondracek commented 1 year ago

I failed to reproduce what I thought was the most severe issue that motivated this proposal for a thorough cleanup with regards to how the geometry is being handled. I was under the impression that sometimes, the grid/lattice parameters read from params.ini are taken into account even though they should have been overwritten with a *.cube input. I do not know what happened then, probably some error in one of my *.cube files that disappeared after I had recalculated everything, but now it looks as though the problem was not with the ppafm code. Anyway, there are still several open issues related to geometry in one way or another, like the recent #201, #200, #199, the older #190 and #6, or even #52, though I have created a pull request associated with the last one recently. I would still prefer forcing some consistent logic onto the geometry handling along with solving those issues rather then just close those with some ad hoc patches that would somehow work until they'd bite us to our backs with any new modification of the code.

mondracek commented 1 year ago

I began reviewing the code and now it seems to me that what I will advocate with respect to the geometry will need not be any big revolution after all. I have looked into core.py and I think I gather the logic followed there. The primary source of information on lattice vectors is typically a argument called cell, an analogy of lvec from HighLevel.py, coommon.py or io.py, but the global parameters , params['gridA'], params['gridB'], params['gridC'] may be used as a backup source if cell is undefined. I believe this is a good way to do things, so I will try to keep this logic consistently. I have a question about the function setFF, defined here: https://github.com/Probe-Particle/ppafm/blob/2344a7b01000bbf8ce3d2ddbf92abf3b9fe917c9/ppafm/core.py#L70 Do we even need it? I mean, it seems to be used only in some obscure examples in examples/_bak/. Do we want to have these as a part of the ppafm package? If not, I suggest we remove those examples along with the function setFF. Otherwise, if we want to keep it, I suggest the following small cleanup: setFF seems to reach for the global grid parameters twice if cell is None; first directly in its body and second time (but this never happens because by then cell has been already defined by the first case) through setFF_shape which in turn calls setGridCell. I would remove the test of cell is None and the subsequent assignment of gridA, gridB, gridC from setFF , relying on this being done in setGridCell. Moreover, if we use gridA, gridB, and gridC as the default for the lattice vectors in core.py, I believe we should also use gridN as the default for the grid size there (in particular, probably in the setGridN function). This enhancement should in fact not change the behavior of our code at this moment, because the flow of the code is now such that the core.py functions are always called for some pre-established FF array and learn the grid from the shape of this array. That is fine and should remain that way, but I still think the setGridN should check the n argument for being None and use gridN as the fallback in such a case, just to make the code more robust.

mondracek commented 1 year ago

Just to clarify how I expect the job related to this issue would be done: I will do it, of course, but only after hearing the opinion of others.

mondracek commented 1 year ago

Another candidate for deletion: wrapAtomsCell in common.py. Seems to be used nowhere. The function it implements, as far as I can tell, is this: shift positions of atoms by [da + 10, db + 10], then fold them back to the first unit cell. Might be useful perhaps, but at the very least, if we are going to keep it and use it, the artificial and mysterious shift by 10 should be removed.

ProkopHapala commented 1 year ago

@mondracek

I suspect this compromise was what @ProkopHapala had in mind when originally writing the code.

yes, it is true I was aiming for such compromise. But it is also true some structures in the code are just for historical reasons (to minimize effor of rewritining parts of code which are already there)

I would still prefer forcing some consistent logic onto the geometry handling along with solving those issues rather then just close those with some ad hoc patches that would somehow work until they'd bite us to our backs with any new modification of the code.

I'm very much pro. If you have ideas pleas go ahead with proposing them. I'm very open to it. I'm just not enough systematic myself, and also currently it is not my priority.

ad setFF

what you meanit is not used? OK, maybe setFF() is not used, but setFF_shape() and setFF_pointer() are used a lot. Without them you cannot share forcefield between python and C++ part.

So yes, this is also perhaps more historical. Firts I wrote setFF(), then I realized that often I need more fine-controle (to set shape and data of the grid forcefield independnetly). So I factored out setFF_shape() and setFF_pointer(). So yes, perhaps now setFF() is almost not used.

ad wrapAtomsCell - I don't really remember what I used it for. Perhaps I used it just once for some debugging.

mondracek commented 1 year ago

@ProkopHapala, thanks for the reply. As for the setFF() function, yeah, I mean setFF_shape() and setFF_Fpointer() or setFF_Epointer() are called directly whenever one might want to call setFF(). So I suggest we either delete the definition of setFF() or we actually use it where appropriate. Perhaps we can unite setFF() and prepareArrays() into one function and use that one. Now, as I understand it, both functions are doing similar jobs but

  1. prepareArrays() does not call setFF_shape(). We always call setFF_shape() shortly after prepareArrays(), however
  2. setFF() calls setFF_shape() but, in contrast to prepareArrays(), it cannot create a new array if the corresponding aray argument is None
mondracek commented 1 year ago

Now I see that the C functions setGridN and setGridCell are defined in both ppafm/cpp/GridUtils.cpp and ppafm/cpp/ProbeParticle.cpp. The definitions are identical so the duplicity should not cause any problem until we happen to change one of them, but still, it does not feel correct.

ProkopHapala commented 1 year ago

Now I see that the C functions setGridN and setGridCell are defined in both ppafm/cpp/GridUtils.cpp and ppafm/cpp/ProbeParticle.cpp. The definitions are identical so the duplicity should not cause any problem until we happen to change one of them, but still, it does not feel correct.

I agree that it is not nice from the point of code structure (modularity etc.). But the point is that GridUtils and ProbeParticle both compiles into independnet binary (.so) and they does not depend on each other.

But yes, it would makes sense to factor out these duplicite things into singe header file (.h). But I'm not sure how it works if the function is in extern C{} (if it is not problem to have extern { in header file ?)

ProkopHapala commented 1 year ago
  1. prepareArrays() does not call setFF_shape(). We always call setFF_shape() shortly after prepareArrays(), however
  2. setFF() calls setFF_shape() but, in contrast to prepareArrays(), it cannot create a new array if the corresponding aray argument is None

OK, to be honest I don't remember exactly the code structure and the rational nor the historical path which lead to this. If you find it nicer that way, feel free to change it as you say

mondracek commented 1 year ago

Good. Concerning the duplicity of setGridN() and setGridCell() function, I'd rather leave it as it is for now because I am also not sure how exactly the C linking works and I might actually break it by trying to repair it. I guess moving the duplicate into a header should work in principle, although the header files might not be exactly intended for such purpose, because including a header file into a *.cpp file through #include should be (as long as I understand it) done by the preprocessor essentially by pasting the plain text of the *.h file into the *cpp file before the compilation. But I am not quite sure. If anyone of you other guys knows how to handle this in C properly, you are welcome to propose a solution. Otherwise, I see now that the cleanup I wanted to make will not require as radical changes of the code structure as I thought when I started this issue. So I will rather pick up more specific tasks concerning the handling of geometry (and force field grids) and I will do the cleaning along the way. One of those specific issues I am planing to address would be that we introduce the proper handling of the origin as recorded in *.xsf and *.cube files. See the second bulleted point in my first comment in this thread. Do you agree that this should be done? Other such issues are those referenced in my second comment here (#6, #190, #199, #200, #52) .

NikoOinonen commented 1 year ago

@ProkopHapala

I think some trubles with molecular geometry and lattice whould be reduced if we define class which pack them togheter (including the origin of the lattice). I know people were suggesting to use ASE but to me it seems overkill (introduce dependnece on huge 3rd party package into core of ppafm - that I don't like). I would rather make something like class AtomicSystem which I have in my other project (it encapsutlates reading and loading, keeps the lattice vectros, originn, atomic types, positions, parameters, etc.). Than we can pass instances of that class into fuctions (rather than passing arrays indepenendtly), which simplyfy the interface.

I'm also very much in support for packing things into a class object. And I already did something in this direction in the OCL API with the DataGrid (except without the atoms): https://github.com/Probe-Particle/ppafm/blob/4e04d47cefac340ab734a2ef4000cc901793c5b0/ppafm/ocl/field.py#L102-L112 It could be nice to somehow interface this with the CPU part, maybe make the DataGrid a subclass of the whatever class we create for the core code, to avoid repeated code and maybe help somehow unify the two parts.

As for ASE, I think it would be a pretty good option for handling things. I get that too many dependencies is not desirable, but ASE is a pretty well established and stable package, and probably many people who use ppafm also already have ASE in their environment. Also, I already used ASE as an optional dependency in the GUI for the geometry viewer.

Making our own solution is fine too, but it's a lot more work.

ProkopHapala commented 1 year ago

I'm also very much in support for packing things into a class object. And I already did something in this direction in the OCL API with the DataGrid (except without the atoms), It could be nice to somehow interface this with the CPU part, maybe make the DataGrid a subclass of the whatever class we create for the core code

Yes, I like your DataGrid class. Maybe we should discuss more what ideas you have about class hierarchy or relations? You mention CPU vs GPU. But I would perhpas first speak about geometry (i.e. atomc coordinates and types) vs grid (3d array) vs lattice vectors.

  1. One option is to make one dynamic class, which have optional fields (e.g. initialized to None). e.g. it may or may not have data grid, it may or may not have atoms, or lattice vectors.
  2. Another option would be to make hierarchy. E.g. starting from Atoms class, then add lattice vectors AtomsPBC and then ad grid (array) ... or in other order?
  3. Another option is to make 2-3 independnet classes e.g. (i) Atoms which strores atomic positions and types (ii) Grid which stores lattice vectors and optional data array. And them pack them together in e.g. AtomGrid class

I'm not sure which is best approach. Because python in very dynamic unlike C++, It woudl maybe slightly prefer (1) i.e. to have optional dynamic fields. The disadvantage is that it would require to put everywhere checks if some field is initialized, like if( lvec is None ):

I have to say, that one thing I hate about python is that due to dynamic initialization of class properties, it is often not clear what properties the class is supposed to have when reading the code (at least comparing to C++). This was actually main reason why I considered classes in python poinless for quite long time. Do you know some good practicies how to make the declaration of class properties in python more explicit?

Also, I already used ASE as an optional dependency in the GUI for the geometry viewer.

Don't understand me wrong. I'm very much pro integration with ASE as optional dependency. I also use ASE and I like it. But I don't like the idea to depend on it in such core functionality as loading and manegind atomic coordinates (i.e. not being able to run ppafm at all without ASE)

NikoOinonen commented 1 year ago
  1. One option is to make one dynamic class, which have optional fields (e.g. initialized to None). e.g. it may or may not have data grid, it may or may not have atoms, or lattice vectors.
  2. Another option would be to make hierarchy. E.g. starting from Atoms class, then add lattice vectors AtomsPBC and then ad grid (array) ... or in other order?
  3. Another option is to make 2-3 independnet classes e.g. (i) Atoms which strores atomic positions and types (ii) Grid which stores lattice vectors and optional data array. And them pack them together in e.g. AtomGrid class

I would avoid deep class hierarchies (2). They become very rigid and difficult to change when new needs arise, and it's often not clear at which level of the hierarchy some piece of data should exist.

So I would opt for composition instead (1, 3). I see 1 and 3 being structurally the same. E.g. either we store the lvec and grid as np.arrays or some custom classes of our own, but in any case they exist as properties of the storing class, and that structure can be changed internally flexibly.

I have to say, that one thing I hate about python is that due to dynamic initialization of class properties, it is often not clear what properties the class is supposed to have when reading the code (at least comparing to C++). This was actually main reason why I considered classes in python poinless for quite long time. Do you know some good practicies how to make the declaration of class properties in python more explicit?

One way to do this is to make all of the internal variables private (prefix with underscore _) and then make an interface by using the @property decorator for class methods. For example, in the DataGrid I store the host numpy array like this: https://github.com/Probe-Particle/ppafm/blob/4e04d47cefac340ab734a2ef4000cc901793c5b0/ppafm/ocl/field.py#L118 but it can also sometimes be None because it exists on the GPU instead. And then I create a property method that returns the array after checking that it's fine: https://github.com/Probe-Particle/ppafm/blob/4e04d47cefac340ab734a2ef4000cc901793c5b0/ppafm/ocl/field.py#L152-L158

The @property decorator makes it so that you can access the method as if it's just an attribute, e.g.

array = datagrid_instance.array

With the doc string in the method, it also makes for nice documentation, so the API is clear: https://ppafm.readthedocs.io/en/latest/ppafm.ocl.html#ppafm.ocl.field.DataGrid.array

NikoOinonen commented 1 year ago

I'm not sure which is best approach. Because python in very dynamic unlike C++, It woudl maybe slightly prefer (1) i.e. to have optional dynamic fields. The disadvantage is that it would require to put everywhere checks if some field is initialized, like if( lvec is None ):

We can probably make this simpler for ourselves by insisting that some of the fields always exist. The lvec we can for example read from the default parameters if it's not set, as Martin suggested above.

I think this consideration also encourages splitting the functionality into multiple classes (option 3.), because then if you only need the grid, for example, then you can just use the Grid class instead of the AtomGrid.

mondracek commented 1 year ago
ProkopHapala commented 1 year ago

we pretend they are the same. I am not saying we need to change this approach right now, but we should keep in mind that we might need to differentiate between the grid vectors and lattice vectors in the future.

Not sure I understand.

mondracek commented 1 year ago

@ProkopHapala: Both, sort of.

By now, I have started working on the problem of properly defining the grid origin, as needed to solve #223. On that occasion, I noticed there are parameters called FFgrid0, FFgridA, FFgridB, and FFgridC, the purpose of which is exactly to define grid geometry different from periodic lattice vectors, but only for the OCL version (see the discussion under #139). I do not really understand how the OCL version works, but can we perhaps adapt the solution used there for the CLI version too?

ProkopHapala commented 1 year ago

@ProkopHapala

the user may be interested in a relatively small scanning region that does not cover the whole system.

I see, make sens

Why bother splitting the code into several scripts unless the output from each of the script can stand on its own? Once the user wishes to view and analyse this force fields separately, it would be desirable that the XSF files include information on the position of atoms and on the lattice vectors.

I agree, I was also considering .xsf files as intermediate result which allows easy debugging and eventually some way to analyze the results. ... Maybe I just don't understand the proper format of .xsf files. I did not saw any problem there. If there is such problem it make sense to correct it.

mondracek commented 1 year ago

@ProkopHapala: As for the problems with the XSF format:

Let me explain here how (to my understanding) the logic with the extra points on the grid cell boundary is supposed to work in the XSF files. First, if the length of a grid vector is a=(ax, ay, az) and there are Na points along that vector, the elementary grid step along that vector is da=(ax, ay, az)/(Na-1). This is always true. It is part of the definition of the XSF format and it is non-obvious, you can imagine a different format that would have Na instead of Na-1 in the denominator of that definition. For a general grid the dimensions of which have nothing to do with the periodic unit cell, this is end of the story. However, there is a little problem when we want the grid to cover exactly the unit cell. Imagine that a=(ax, ay, az) is now the lattice vector. We want to sample the cell by Na points along this vector with the grid points being (of course) equally spaced and, for starters, they are supposed to be unique, they do not repeat themselves because of the periodicity. The spacing between the points should of course be da=(ax, ay, az)/Na under this condition. So what should the corresponding grid vector be? Right, considering the formula that defines the relation between the grid vector and the grid spacing, it has to be a' = a*(Na-1)/Na

But this is inelegant, we would prefer to have a' = a and this is what most codes (but not FHI-AIMS!) generating XSF files for periodic structures chose to enforce. The price to pay is adding a last point along every axis that is an identical copy of the first point by the lattice periodicity. So now, we have Na +1 points instead of Na and the last point is kind of superfluous in terms of information content, but the lattice vectors and grid vectors are the same, which makes us happy, because we can immediately see (looking at the XSF file header) that the grid indeed covers the whole unit cell. So, to sum up

mondracek commented 1 year ago

@ProkopHapala Is there any good reason why the order of indices for the grid array is FF[iz,iy,ix] rather than FF[ix,iy,iz] in the CLI version? I used to think this is needed for the CPP parts of the code to be efficient but now as I look deeper into it, that does not seem to be the case.

ProkopHapala commented 1 year ago

Is there any good reason why the order of indices for the grid array is FF[iz,iy,ix] rather than FF[ix,iy,iz] in the CLI version? I used to think this is needed for the CPP parts of the code to be efficient but now as I look deeper into it, that does not seem to be the case.

I think efficiency-wise it is rather opposite (we compute each z-stroke independently => better if z is the fastest axes in the array )

I think the reason for FF[iz,iy,ix] ordering was perhaps plotting ? (z-slices), but it is not very good motivation (matplotlib can plot FF[:,:,iz] as well as FF[iz,:,:] ). Maybe it is somehow useful in Fz->df conversion (but perhaps not), or for saving the data to .xsf ? I don't know. Eventually we may change the order ... but it would require changes on many places in the code ... co unless there is good reason .... ?

But

mondracek commented 1 year ago

@ProkopHapala, thanks for the explanation with respect to the array ordering. As for the Fz->df conversion and saving *.xsf files, I happen to know something about that stuff and no, the reversed index order is not needed for it. In the Fz to df conversion, it does not matter, the conversion is in the end just a convolution of Fz with the appropriate kernel along the z axis and we would literally only have to change axis=0 to axis=2 as an argument to an np.apply_along_axis function at one place in the code in order to adapt the Fz2df function to the natural ordering. In reading and saving *.xsf files, the reversed ordering actually complicates stuff because there is a difference between the CLI and OCL version in the array ordering so the loadXSF and loadCUBE functions need a xyz_order argument which tells them whether they should transpose the array or not. Now I see that saveXSF does not distinguish between CLI and OCL versions but rather always assumes the reverse order. I did not notice this last fact before. I do not know how it works in the OCL version, perhaps the ordering is really not consistent, arrays which we need to read to XSF are ordered one way, arrays we need to save to XSF the other way and we hope there is nothing we want to read, modify without doing the transposition and then save. I was about to offer to volunteer in changing the ordering so that it would always be FF[ix,iy,iz] but I thought I would not require touching the OCL part. Now that I see the issue affects both CLI and OCL versions, I am not sure I want to do it.

NikoOinonen commented 1 year ago

@mondracek As far as I understood, the zyx ordering in the IO functions was mostly because that is how the XSF file format is defined. I specifically added those xyz_order arguments to the functions so that I could get the arrays in the xyz order to use in the OCL code.

It should not be a big change. I think the only places in the OCL code that do IO related things are these two methods: https://github.com/Probe-Particle/ppafm/blob/9f009aec25d8d895f4168f061501211a37a9dd1c/ppafm/ocl/field.py#L192-L195 https://github.com/Probe-Particle/ppafm/blob/9f009aec25d8d895f4168f061501211a37a9dd1c/ppafm/ocl/field.py#L227-L230 The to_file method in particular transposes the array before saving.

mondracek commented 1 year ago

@NikoOinonen, thanks for the explanation. So @ProkopHapala was right after all that the likely original motivation of the zyx order in the CLI code was reading and saving XSF files. The native ordering in the XSF file is the opposite of what we'd need, either in CLI or OCL. Seems that historically, the approach in CLI was to accept that ordering and interpret the array indices as ZYX while transpose was performed for OCL. Okay, so we would not have to touch the OCL version at all in order to introduce consistent usage of the XYZ order but thorough changes would be needed in CLI. When done, the I/O functions would not have to differentiate between CLI and OCL and the CLI code would be a bit easier to understand but there is a big danger of introducing errors into the CLI code while implementing the change. Though the errors should be easy to spot by testing, a wrong data order on the grid would destroy the resulting figures completely. What do you think, is it worth the effort? I will not push the change at this moment myself but if anyone else would like it to be done, I am still ready to implement it.

mondracek commented 1 year ago

As for the OutFz.xsf files I mentioned earlier in this thread, I see that the inclusion of the sample atomic structure in those files has been already implemented. I was probably looking at files produced by an old version of the code when writing my earlier comments. But the extra points at the scanning region boundary are probably still there.

ProkopHapala commented 1 year ago

What do you think, is it worth the effort?

I think it is woth it.

I will not push the change at this moment myself but if anyone else would like it to be done, I am still ready to implement it.

if you implement it, I'm pro :-) I think it is worth it, but perhaps does not have hi priority.

Though the errors should be easy to spot by testing,

Copy that