FAIR-Chem / fairchem

FAIR Chemistry's library of machine learning methods for chemistry
https://opencatalystproject.org/
Other
770 stars 242 forks source link

Convert extxyz data into lmdb #572

Closed Anandu07 closed 11 months ago

Anandu07 commented 1 year ago

I've downloaded a dataset for specific adsorbates from DATASET_PER_ADSORBATE.md because I need to train a model specifically for these adsorbates. In order to proceed with the training, I'm attempting to convert this dataset into an LMDB format using a script. However, I'm encountering an issue with code I used I'm only able to get .mdb files, which throws error when I try to train the model. I would greatly appreciate it if anyone could offer suggestions or provide a code snippet to assist me with this challenge.

Below is the code I used to convert to 'lmdb format'


env = lmdb.open(lmdb_path, map_size=1e12)  

with env.begin(write=True) as txn:  
    for filename in os.listdir(folder_path):  
        if filename.endswith('.extxyz.xz'):  
            # decompress the .xz file  
            with lzma.open(os.path.join(folder_path, filename)) as f:  
                decompressed_data = f.read()  

            # write the decompressed data to a .extxyz file  
            with open(os.path.join(folder_path, filename[:-3]), 'wb') as f:  
                f.write(decompressed_data)  

            # read the .extxyz file using ASE  
            atoms = read(os.path.join(folder_path, filename[:-3]))  

            # serialize the atoms object and store it in the lmdb file  
            txn.put(filename[:-3].encode(), pickle.dumps(atoms))
```   `
mshuaibii commented 1 year ago

Hey -

Assuming you're using the LmdbDataset to read the LMDB the reason you're getting an error is because our trainers expect the data to be in a specific format rather than ASE objects.

Are you trying to train an S2EF or IS2RE model, depending on which one the data is expected in a different format? I can help provide a sample script once I get a better idea of the problem.

Anandu07 commented 1 year ago

Thanks for the response. I'm trying to retrain/finetune an S2EF model (SCN/Equiformer) on specific adsorbate data, also later I want to experiment on IS2RE models as well:).

mshuaibii commented 1 year ago

Yup. Try taking a look at the end sections here https://github.com/Open-Catalyst-Project/ocp/blob/main/tutorials/OCP_Tutorial.ipynb: (Optional) Creating your own LMDBs for use in the OCP repository. This should help you get set up for creating S2EF and IS2RE datasets. Replace system_paths with the paths to your extxyz files, this code works for any ASE-parseable data format.

Let me know if you have any further questions.

@emsunshine If the ASE lmdb is easier to use here maybe you can provide guidance on that.

emsunshine commented 1 year ago

I would definitely recommend using one of the ASE datasets in this scenario. If you have ASE-readable files or an ASE DB you can avoid dealing with LMDBs. Here is some more information.

github-actions[bot] commented 11 months ago

This issue has been marked as stale because it has been open for 30 days with no activity.