Andrew-S-Rosen / QMOF

The QMOF Database: A database of quantum-mechanical properties for metal-organic frameworks.
MIT License
119 stars 25 forks source link

process the MOF_data to get the file of opt-geometries.xyz #32

Closed fei-wang-1314 closed 1 year ago

fei-wang-1314 commented 1 year ago

Hi, how to process the MOF_data to get the file of opt-geometries.xyz? Thank you.

Andrew-S-Rosen commented 1 year ago

@fei-wang-1314: Thanks for asking. Since the original release of the QMOF Database, I have updated the files to be in a more accessible format (JSON and CIF). If you'd like to process the data to make a single .xyz file, simply do


import os
from ase.io import read, write
p = '/path/to/cifs'
mofs = []
for cif in os.listdir(p):
    mofs.append(read(os.path.join(p,cif)))
write('opt-geometries.xyz',mofs)
fei-wang-1314 commented 1 year ago

@arosen93 Hi, thank you very much. Your reply help me a lot. Since I am not researcher in chemistry and I major in computer science, I plan to design new AI model for this dataset. There are some problems I need to consult you. In the repo: https://github.com/usccolumbia/deeperGATGNN, the author preprocess the QMOF as follows:

  import ase
  import os
  from ase.io import read
  import numpy as np
  import csv

  mofs = ase.io.read('opt-geometries.xyz',index=':')
  refcodes = np.genfromtxt('opt-refcodes.csv',delimiter=',',dtype=str)
  properties = np.genfromtxt('opt-bandgaps.csv',delimiter=',',dtype=str)
  print(len(mofs), len(refcodes), properties.shape)

  if not os.path.exists('MOF_data'):
    os.mkdir('MOF_data')
  count=0
  targets=[]
  for i in range(0, len(refcodes)):
    ase.io.write(os.path.join('MOF_data',str(refcodes[i])+'.json'), mofs[i])
    targets.append([str(refcodes[i]), properties[i+1,1]])
    count=count+1
  with open(os.path.join('MOF_data',"targets.csv"), 'w', newline='') as f:
    wr = csv.writer(f)
    wr.writerows(targets)        
  print(count)

As shown at the three lines:

  mofs = ase.io.read('opt-geometries.xyz',index=':')
  refcodes = np.genfromtxt('opt-refcodes.csv',delimiter=',',dtype=str)
  properties = np.genfromtxt('opt-bandgaps.csv',delimiter=',',dtype=str)

The author does not prepare the three file: opt-geometries.xyz, opt-refcodes.csv and opt-bandgaps.csv. To preprocess the QMOF dataset, the three files need to be processed previously. How I process the QMOF dataset to get the three files? Would you like to provide some Pytohn code for processing it? You time and effort are highly appreciated.

Thank you!

Andrew-S-Rosen commented 1 year ago

The data can be readily obtained from the JSON file provided in the Figshare repository. Pull out the data you'd like and simply convert it to csv. Unfortunately, I can't help with using someone else's code.

Alternatively, use an older version of the Figshare repository when the CSV files were still the way in which the data was disseminated.

fei-wang-1314 commented 1 year ago

@arosen93 Thank you very much. According to your help, I have found the band gap file at 'https://figshare.com/articles/dataset/Ensemble_Band_Gap_Data/8295503?file=15544181' For the two files:

  mofs = ase.io.read('opt-geometries.xyz',index=':')
  refcodes = np.genfromtxt('opt-refcodes.csv',delimiter=',',dtype=str)

using the file:cifs_to_xyz.py at this repo.