database: 数据库字段

ixsluo commented 1 year ago

Description & Motivation

calydb数据库，rawcol表

mongodb字段

{
"_id":               ...,
"material_id":       "caly-{index}",         # str
"source": {
    "name": "calypso",      # calypso or materials project, etc.
    "index": 0,             # int index in this source, or None
},
"elements",          ["H", "O"],         # list of str
"nelements":         2,                  # int
"elemcount":         [2, 1],             # list of int
"species":           ["H", "H", "O"],    # list of str, species of each atom
"formula":           "H2O",              # str, metal and alphabet order
"reduced_formula":   "H2O",              # str, metal and alphabet order
"natoms":            3,                  # int
"cell":              3*3,
"positions":         natoms*3,
"scaled_positions":  natoms*3,
"forces":            natoms*3,
"enthalpy":          0.0,
"enthalpy_per_atom": 0.0,
"volume":            0.0,
"volume_per_atom":   0.0,
"density":           0.0,
"clospack_volume":          0.0,    # float, A^3
"clospack_volume_per_atom": 0.0,    # float, A^3
"clospack_density":         0.0,    # float, g/cm^3
"pressure":                 0.0,    # float, GPa
"pressure_range": {             # each pressure is set to a bin
    "mid":          "0.0",      # starts=-0.1, width=0.2
    "length":       "0.2",      # e.g. 10 -> (-9.9, 10.1]
    "closed":       "right",    # default left-open right-closed
},                              # for group structures

"trajectory": {
    "nframes":           0,                  # int
    "cell":              nframes*3*3,        # np.ndarray, angstrom
    "positions":         nframes*natoms*3,   # np.ndarray, angstrom
    "scaled_positions":  nframes*natoms*3,   # np.ndarray
    "forces":            nframes*natoms*3,   # np.ndarray
    "volume":            nframes,            # np.ndarray, A^3
    "enthalpy":          nframes,            # np.ndarray, eV
    "enthalpy_per_atom": nframes,            # np.ndarray, eV
    "source":                   [],     # source path of each frame, for dev
    "source_idx":               [],     # index in each source, for dev
    "source_dir":               "",     # source dir, for dev
},

"calyconfig": {
    "version": "legacy",                 # input.dat
    "icode": 1,    
    ...
},
"dftconfig":                  [incar1, ...],       # list of str, for multi optimization
"pseudopotential":            [head1, ...],        # list of str, for each element
"symmetry": {                                      # str, symprec, %.0e
    "1e-1": {"number": 187, "symbol": "P -6 m 2",},
    "1e-2": {...},
    "1e-5": {...},
},
"donator":            {"name": "", "email": ""},
"deprecated":         False,                      # bool
"deprecated_reason":  "...",                      # str
"last_updated_utc":   datetime.utcnow(),
}

Pitch

No response

Alternatives

No response

Additional context

随时补充

ixsluo commented 1 year ago

重复结构判断与记录

判断方法 pymatgen StructureMatcher （找到每批结构，以及其中能量最低的结构）（需要补充）

数据库记录与rawcol分开，单独储存uniquecol表分，但是仍记录其在rawcol内的_id


{
"_id":                    ...,             # 与rawcol中相同的，焓最低的结构的_id
"formula":                ...,             # str
"elements":               [...],
"nelements":              2,
"natoms":                 3,
"pressure":               0.0,
"enthalpy_per_atom":      0.0,             # 最低的能量，方便查询
"volume":                 0.0,             # 方便查询
...                                        # （待补充其他属性）

"cell": 3x3, # 方便直接提取结构 "positions": ndarray, "scaled_positions": ndarray,

"symmetry": { # 可以只储存一个空间群信息 "symprec": 1e-2, "number": 187, "symbol": "P-6m2", }, "rawid": [ # 与该结构对应的rawcol表中所有结构的_id ObjectId(...), ObjectId(...), ] }

ixsluo commented 1 year ago

为保持rawdata字段与单个结构的字段一致，多frame的放在traj前缀下，单独储存最后一步结构

wangzyphysics commented 1 year ago

判断结构相似性的思路

根据空间群和能量分段，每一段进行判断
定义一个函数，输入为一批结构{s1,s2,...} 输出为一个字典，{s1:{'origin': True/False, 'same_list': [s2, s5]}}

origin的key是为了判断每一堆相似的结构中取出来哪一个 same_list可以在构建ini opt映射时几种不同的ini都去到同一个opt

ixsluo commented 1 year ago

判断结构相似性的思路

根据空间群和能量分段，每一段进行判断

定义一个函数，输入为一批结构{s1,s2,...} 输出为一个字典，{s1:{'origin': True/False, 'same_list': [s2, s5]}}

origin的key是为了判断每一堆相似的结构中取出来哪一个 same_list可以在构建ini opt映射时几种不同的ini都去到同一个opt

按压力区间和formula分组
再按能量排序后按能量间隔分组，即将认为相邻能量差大于δ（如10meV）的一定不同，减少计算量
再按空间群分组(1e-2)，减少计算量
同组内两两做pymatgen match，求出连通性
以各不连通子图内能量最低的结构作为各不重复结构

ixsluo commented 1 year ago

最终完成字段见README

ICCMS-CALYPSO / CALYPSO-kit