"verdi data structure export -F cif" prints a weird CIF

danieleongari commented 5 years ago

When using verdi data structure export -F cif {pk}, using a structure imported as:

structure = CifData(file="path/to/file.cif").get_structure()

or with other methods, the output is:

\
##########################################################################
#               Crystallographic Information Format file
#               Produced by PyCifRW module
#
#  This is a CIF file.  CIF has been adopted by the International
#  Union of Crystallography as the standard for data archiving and
#  transmission.
#
#  For information on this file format, follow the CIF links at
#  http://www.iucr.org
##########################################################################

data_0

loop_
  _atom_site_label
  _atom_site_fract_x
  _atom_site_fract_y
  _atom_site_fract_z
  _atom_site_type_symbol
         Cu1       0.6622700000000066  0.4919400000000052  0.9315999999999788  Cu       
         Cu2       0.06541999999999913           0.9895999999999744  0.5822199999999982 Cu       
         Cu3       0.05868999999999878           0.5088899999999885  0.07581999999999815          Cu       
         Cu4       0.659620000000003   0.0129099999999998  0.4293500000000015  Cu       
         Cu5       0.5664800000000144  0.5118599999999801  0.9333299999999778  Cu       
[...]
         O36       0.11567999999999903           0.7710799999999893  0.9415500000000252 O        
         O37       0.610409999999993   0.23086999999999813           0.057639999999999504         O        
         O38       0.0864299999999989  0.26857999999999793           0.4269500000000004 O        
         O39       0.6108600000000034  0.27705999999999786           0.5627399999999987 O        
         O40       0.047789999999999624          0.25854999999999817           0.745750000000006  O
_cell_angle_alpha                       89.9669
_cell_angle_beta                        91.05799999999998
_cell_angle_gamma                       91.35589999999999
_cell_length_a                          26.662000000000095
_cell_length_b                          31.92000000000033
_cell_length_c                          16.2887
loop_
  _symmetry_equiv_pos_as_xyz
         'x, y, z' 
_symmetry_int_tables_number             1
_symmetry_space_group_name_H-M          'P 1'

I see two problems with this: 1) the cell is printed after the coordinates, which is unconventional and it will make the parser of the most of the programs fail. 2) some unwanted indexing is appended to the element name of _atom_site_label.

danieleongari commented 5 years ago

@ltalirz @yakutovicha , I want to raise the attention on this problem of converting StructureData to CifData, where the inner code is something like:

CifData(ase=structuredataobject.get_ase())

which print such as unconventional CIF format, using PyCifRW. Since we use CifData objects a lot, can we fix this?

mpougin commented 3 years ago

@ltalirz @yakutovicha can you tell me if somebody looked into this issue in the past? I want to run a raspa calculation, where I define different H-atom-types in my force field, i.e. I have to set RemoveAtomNumberCodeFromLabel to false. This results in a ValueError:

ValueError: invalid attribute value extra keys not allowed @ data['RemoveAtomNumberCodeFromLabel']

I assume, the setting of this parameter was switched off due to the issue described by Daniele.

ltalirz commented 3 years ago

@danieleongari can you briefly elaborate why you are linking to this issue from https://github.com/lsmo-epfl/aiida-lsmo/blob/0999ccec3e445cfd0dfd37a65ab013299a5f7d51/aiida_lsmo/workchains/sim_annealing.py#L154

What exactly would be needed in the CIF output to be able to set RemoveAtomNumberCodeFromLabel to false?

danieleongari commented 3 years ago

I linked to this issue to show an example on how the handling of cif/structure data in AiiDA 1 by PyCifRW was automatically assigning some sorted index as _atom_site_label, which is later making Raspa create one atom type for every label. Even if the value of the single atom force field is fine according to the "element+underscore" convention of Raspa, but considering that all the combinations of parameters for atom-atom are printed in the output file, this could make each file >1MB bigger because of cumbersome lines.

In general, I believe one should be very careful in using different atom types for framework's atoms in these work chains, because it is likely that the label of the atom type gets altered somewhere. In our use case there was not any need for such a feature and I did never fully investigated into that: setting RemoveAtomNumberCodeFromLabel=False was a quick and effective way to avoid any problem.

ltalirz commented 3 years ago

thanks for the quick explanation!

ltalirz commented 3 years ago

So, to summarize my understanding:

The current CIF output generated by get_structure() results in different _atom_site_label values for each atom (index appended)
This triggers raspa to create a separate atom type for each atom, which unnecessarily complicates both the interaction computation and the raspa output
RemoveAtomNumberCodeFromLabel=True is a workaround that avoids this problem by removing the the appended index before passing the file to raspa. Of course, this would also affect a user-defined index like in the case of @mpougin , resulting (incorrectly) in the same atom type

A quick workaround for @mpougin could be to use a fantasy site label with a different letter (?)

In the medium term, we may look into making it possible to tell the CIF output not to append the indices, which would then allow us to avoid any postprocessing of the atom labels passed to raspa.

Does that sound about right?

ltalirz commented 3 years ago

@mpougin that does sound like a different problem - perhaps open a separate issue and include the full stack trace of the error, including information on versions of aiida-lsmo, aiida-core, python, etc.

aiidateam / aiida-core

"verdi data structure export -F cif" prints a weird CIF #3304