aimat-lab / 3DSC

Repo for the paper publishing the superconductor database with 3D crystal structures.
Other
15 stars 5 forks source link

Details about generating the ICSD version #2

Open YanjunLiu2 opened 2 months ago

YanjunLiu2 commented 2 months ago

Hi,

I got the license to access the ICSD api, but it's a bit unclear what I should do to generate the 3DSC_ICSD dataset. Should I download all the cifs myself and put them into a folder? Or do you have the download process already built in? Sorry that I'm not really good at reading codes. Thank you!

Best wishes,

Yanjun

TimoSommer commented 2 months ago

Hey Yanjun,

Yes, the first step is to download all the cifs into a directory. You can do this using the excellent repo https://github.com/simonverret/materials_data_api_scripts, or you can use the already downloaded code of this repo in the 3DSC repo under 3DSC/superconductors_3D/dataset_preparation/dataset_download/materials_data_api_scripts-master. I remember that I did do some small changes to the code, so I would recommend to try the code in my repo first, but if you get stuck just check out the original repo of Simon.

This code should then download all the cif files into a directory, and also give you a .csv file with information about all the downloaded cifs. You should then put the cifs under 3DSC/data/source/ICSD/raw/cifs/ and the .csv under 3DSC/data/source/ICSD/raw/0_all_data_ICSD.csv. You can then run the script generate_3DSC.py.

If you run into any issues, I would recommend you to execute this script not via the command line, but using a Python debugger. That way, you can easily go through the code line by line and see exactly how it works. That's what I usually do if I want to analyse the working of a new code for me. Both Pycharm and Spyder have good debuggers.

Let me know how it works!

Best regards, Timo

YanjunLiu2 commented 2 months ago

Hi Timo,

Thank you! I'm now trying to download the cifs. I saw that in your icsd folder the download.py is missing and thus I directly copied the one in simon's repo to the folder and ran it, but unfortunately I got:

(ICSD) yanjunliu@dhcp-vl2041-23489 materials_datasets % python icsd/download.py Traceback (most recent call last): File "/Users/yanjunliu/Documents/Read_database/materials_data_api_scripts-master/materials_datasets/icsd/download.py", line 153, in download_all(usrname, passwrd, ICSD_PKL) File "/Users/yanjunliu/Documents/Read_database/materials_data_api_scripts-master/materials_datasets/icsd/download.py", line 131, in download_all with ICSD_Session(loginid, password) as icsd: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/yanjunliu/Documents/Read_database/materials_data_api_scripts-master/materials_datasets/icsd/download.py", line 25, in init self.login_token = self.login() # sets self.login_token ^^^^^^^^^^^^ File "/Users/yanjunliu/Documents/Read_database/materials_data_api_scripts-master/materials_datasets/icsd/download.py", line 52, in login raise ConnectionError(login_response.headers) ConnectionError: {'Set-Cookie': 'ICSDCHECK=1727465301266; Path=/, JSESSIONID=7D1BF39A51CB269160AE2159789F9B26; Path=/; HttpOnly, FIZ-Cookie=238830221.16671.0000; path=/; Httponly; Secure', 'Content-Language': 'en', 'Content-Type': 'text/plain', 'Content-Length': '104', 'Date': 'Fri, 27 Sep 2024 19:28:20 GMT', 'Keep-Alive': 'timeout=20', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=16070400; includeSubDomains'} I think this is most likely an issue with Simon's code, but I wanted to ask if you understand what this error is. If it's not immediately obvious, please feel free to ignore this message.

Best wishes, Yanjun

YanjunLiu2 commented 2 months ago

Hi Timo,

Since simon hasn't replied to my issue, I tried another icsd client: https://github.com/lrcfmd/ICSDClient. And this one can work correctly. However, the code fetches CIFS based on the collection code instead of the icsd id. Do you have the collection codes for the CIFS listed in the icsd version of 3DSC? Thank you!

Best wishes, Yanjun

TimoSommer commented 1 month ago

Hey Yanjun,

the first error message to me looks like you haven't corrrectly setup your ICSD credentials. Have you doublechecked that with the instructions in Simons repo?

For the second question, it's very unfortunate that the ICSD has different collection codes and ICSD IDs, but unfortunately I currently don't have access to the collection codes. Can you maybe just download all of them and then check the ICSD ID in each downloaded cif?

Best regards, Timo

YanjunLiu2 commented 1 month ago

Hi Timo,

I downloaded the whole icsd cifs, and I can select those in the dataset. However, since I used a different script, I don't have the .csv file needed. Could you describe what's in the .csv file? Or is it possible to share that csv file with me? Thank you!

Best wishes, Yanjun

TimoSommer commented 1 month ago

Hey Yanjun,

good work! I don't have access to the file currently, but it should be quite straightforward to see from the code which properties are needed. I'd recommend you to go through the code in the files _1_clean_cifs.py and _2_2_clean_ICSD.py and check which properties are used in the code. The function clean_ICSD() even has a list at the beginning of all the properties it needs defined in the csv. The properties ending on _pymatgen you can ignore, those should be calculated automatically in the file _1_clean_cifs.py, but all properties starting with an _ are properties from the ICSD which you should find in each cif. For these properties, just write a script that extracts these properties from the cifs you downloaded and writes them to a csv file, and voila.

From what I see right now, you should extract the following properties: ['_database_code_icsd', '_chemical_formula_sum', '_cell_measurement_temperature', '_diffrn_ambient_temperature', '_chemical_name_structure_type', '_exptl_crystal_density_diffrn', '_chemical_formula_weight', '_cell_length_a', '_cell_length_b', '_cell_length_c', '_cell_angle_alpha', '_cell_angle_beta', '_cell_angle_gamma', '_cell_volume', '_cell_formula_units_z', '_symmetry_space_group_name_H-M', '_space_group_IT_number', '_diffrn_ambient_pressure']

Additionally, there is one property called 'file_id', which should be an absolute path to each cif structure in the directory 3DSC/data/source/ICSD/raw/cifs/. You can see this at the beginning of the code of the function clean_cifs().

From there, you can try to execute the code and see if there comes up any error with an unknown property. I would recommend you to try everything on a small sample of only 100 or so cif files first to speed up this process, which should be as easy as reducing the input csv to just the first 100 rows, since the code just looks up the paths in the csv and then reads in the cifs, but I don't think it ever reads in all the cifs in the directory.

Let me know how it goes!

Best regards, Timo

YanjunLiu2 commented 1 month ago

Hi Timo,

Thank you! This is very detailed. I checked some cifs in the list, but obviously not all of them have this complete set of important_cols. For example in the cif attached, for which the artificial doping will not even be applied, the '_cell_measurement_temperature', '_diffrn_ambient_temperature', 'cif_pymatgen_path', '_chemical_formula_weight', '_diffrn_ambient_pressure' are missing. Will this cause error? Or I can just leave the related cells empty?

To enable uploading I just turn it to txt format.

Best wishes,

Yanjun icsd_291700.txt

TimoSommer commented 1 month ago

Hey Yanjun,

the best way to treat missing entries is usually to set them to numpy.nan or pandas.NA. Pandas knows how to deal with these entries. Setting them to None is not done usually. If you leave them unspecified when generating a pandas dataframe, pandas will automatically make them numpy.nan. You could also check what is done for these cases in Simon's code, but I bet that's what's happening.

Btw, you should not set the cif_pymatgen_path yourself, it will be set in the file _1_clean_cifs.py on line 196. Sorry for mentioning this in my earlier comment by accident, I deleted it from the list there.

Best regards, Timo

YanjunLiu2 commented 1 month ago

1_all_data_ICSD_cifs_normalized.csv 0_all_data_ICSD.csv Hey Timo,

The past three weeks were a bit crazy because of the March meeting deadline, and now I finally have time to resume attempts at generating the dataset 😂. I seemed to be able to extract the 0_all_data_ICSD.csv and run the _1_clean_cifs.py to get the cifs in the cleaned folder and the new csv. I attached the two csv files I got. Could you take a look and see whether they look fine? One question is that in all cifs there are only 'space_group_name_H-M_alt' instead of '_symmetry_space_group_name_H-M'. Should I assign the 'space_group_name_H-M_alt' symbols in the cifs to the '_symmetry_space_group_name_H-M' column in the csv?

Also, there seems to be another csv file needed, named ICSD_content_type.csv. Could you let me know what it is and how to generate it? Thank you!

Best regards, Yanjun

TimoSommer commented 3 weeks ago

Hey Yanjun,

the input files should have space_group_name_H-M_alt since this is the property in the ICSD. In the code, in file _2_2_clean_ICSD.py, line 81, this is renamed to _symmetry_space_group_name_H-M. This is because originally I had also tried the COD instead of the ICSD, and this was the name in the COD files for the same property, so I standardised it in the code to always have the COD name. So you don't have to do it, but it will automatically renamed.

For the ICSD_content_type.csv, this is another csv which contains information about whether each structure is experimental or theoretical. Unfortunately, right now I do not remember exactly where I got this from, somewhere I think from the ICSD. This csv file has three columns: EXPERIMENTAL_INORGANIC, EXPERIMENTAL_METALORGANIC or THERORETICAL_STRUCTURES. Each column contains the file_id of each structure that is in this group (there, unintuitively, the three columns have different lengths). If a structure is in one of the first two groups it's experimental, otherwise theoretical. The code for this is in file _2_2_clean_ICSD.py line 152-158.

Also, in case you don't know where to get this file from, I would recommend you to just make a pseudo csv in which all structures are in the experimental groups. I think this information was not particularly relevant, it was just some more information I was hoping to maybe play with.

YanjunLiu2 commented 3 weeks ago

Hi Timo,

Thank you! I'm now able to run _2_2_clean_ICSD.py, and I tried to run python generate_3DSC.py -d ICSD -n 4. However, it seems that I'm missing the ICSD_subset.csv this time. Could you point me to where I should check?

Best wishes, Yanjun

TimoSommer commented 2 weeks ago

Hey Yanjun,

I think it's very difficult right now to debug this for me, since I don't have access to the files anymore. I think it would be very helpful if you could send me all csv files that you have so far, plus a selection of 20 ICSD entries, 10 of which are mentioned in the file superconductors_3D/data/final/ICSD/3DSC_ICSD_only_IDs.csv and 10 of which are not. That way, I can debug this on my own and then provide a full tutorial for you how to do this.

Could you please send me these files to the email address you have from me?

TimoSommer commented 1 week ago

Hey Yanjun,

thank you very much for your help and the files you sent me. I have majorly simplified the installation and the run by optionally skipping the generation of ML features, which is not necessary for the dataset itself and was a major issue because it required quite big and difficult to install python packages. However, the ML features will still be generated if the corresponding packages are installed.

I have also used the example data that you provided to showcase the structure of the input data and explained this in a little tutorial in the README. It should be pretty much plug & play now.

If you got any more questions, please let me know.