aimat-lab / 3DSC

Repo for the paper publishing the superconductor database with 3D crystal structures.
Other
15 stars 4 forks source link

Details about generating the ICSD version #2

Open YanjunLiu2 opened 1 week ago

YanjunLiu2 commented 1 week ago

Hi,

I got the license to access the ICSD api, but it's a bit unclear what I should do to generate the 3DSC_ICSD dataset. Should I download all the cifs myself and put them into a folder? Or do you have the download process already built in? Sorry that I'm not really good at reading codes. Thank you!

Best wishes,

Yanjun

TimoSommer commented 1 week ago

Hey Yanjun,

Yes, the first step is to download all the cifs into a directory. You can do this using the excellent repo https://github.com/simonverret/materials_data_api_scripts, or you can use the already downloaded code of this repo in the 3DSC repo under 3DSC/superconductors_3D/dataset_preparation/dataset_download/materials_data_api_scripts-master. I remember that I did do some small changes to the code, so I would recommend to try the code in my repo first, but if you get stuck just check out the original repo of Simon.

This code should then download all the cif files into a directory, and also give you a .csv file with information about all the downloaded cifs. You should then put the cifs under 3DSC/data/source/ICSD/raw/cifs/ and the .csv under 3DSC/data/source/ICSD/raw/0_all_data_ICSD.csv. You can then run the script generate_3DSC.py.

If you run into any issues, I would recommend you to execute this script not via the command line, but using a Python debugger. That way, you can easily go through the code line by line and see exactly how it works. That's what I usually do if I want to analyse the working of a new code for me. Both Pycharm and Spyder have good debuggers.

Let me know how it works!

Best regards, Timo

YanjunLiu2 commented 1 week ago

Hi Timo,

Thank you! I'm now trying to download the cifs. I saw that in your icsd folder the download.py is missing and thus I directly copied the one in simon's repo to the folder and ran it, but unfortunately I got:

(ICSD) yanjunliu@dhcp-vl2041-23489 materials_datasets % python icsd/download.py Traceback (most recent call last): File "/Users/yanjunliu/Documents/Read_database/materials_data_api_scripts-master/materials_datasets/icsd/download.py", line 153, in download_all(usrname, passwrd, ICSD_PKL) File "/Users/yanjunliu/Documents/Read_database/materials_data_api_scripts-master/materials_datasets/icsd/download.py", line 131, in download_all with ICSD_Session(loginid, password) as icsd: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/yanjunliu/Documents/Read_database/materials_data_api_scripts-master/materials_datasets/icsd/download.py", line 25, in init self.login_token = self.login() # sets self.login_token ^^^^^^^^^^^^ File "/Users/yanjunliu/Documents/Read_database/materials_data_api_scripts-master/materials_datasets/icsd/download.py", line 52, in login raise ConnectionError(login_response.headers) ConnectionError: {'Set-Cookie': 'ICSDCHECK=1727465301266; Path=/, JSESSIONID=7D1BF39A51CB269160AE2159789F9B26; Path=/; HttpOnly, FIZ-Cookie=238830221.16671.0000; path=/; Httponly; Secure', 'Content-Language': 'en', 'Content-Type': 'text/plain', 'Content-Length': '104', 'Date': 'Fri, 27 Sep 2024 19:28:20 GMT', 'Keep-Alive': 'timeout=20', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=16070400; includeSubDomains'} I think this is most likely an issue with Simon's code, but I wanted to ask if you understand what this error is. If it's not immediately obvious, please feel free to ignore this message.

Best wishes, Yanjun

YanjunLiu2 commented 1 week ago

Hi Timo,

Since simon hasn't replied to my issue, I tried another icsd client: https://github.com/lrcfmd/ICSDClient. And this one can work correctly. However, the code fetches CIFS based on the collection code instead of the icsd id. Do you have the collection codes for the CIFS listed in the icsd version of 3DSC? Thank you!

Best wishes, Yanjun

TimoSommer commented 6 days ago

Hey Yanjun,

the first error message to me looks like you haven't corrrectly setup your ICSD credentials. Have you doublechecked that with the instructions in Simons repo?

For the second question, it's very unfortunate that the ICSD has different collection codes and ICSD IDs, but unfortunately I currently don't have access to the collection codes. Can you maybe just download all of them and then check the ICSD ID in each downloaded cif?

Best regards, Timo

YanjunLiu2 commented 5 days ago

Hi Timo,

I downloaded the whole icsd cifs, and I can select those in the dataset. However, since I used a different script, I don't have the .csv file needed. Could you describe what's in the .csv file? Or is it possible to share that csv file with me? Thank you!

Best wishes, Yanjun

TimoSommer commented 4 days ago

Hey Yanjun,

good work! I don't have access to the file currently, but it should be quite straightforward to see from the code which properties are needed. I'd recommend you to go through the code in the files _1_clean_cifs.py and _2_2_clean_ICSD.py and check which properties are used in the code. The function clean_ICSD() even has a list at the beginning of all the properties it needs defined in the csv. The properties ending on _pymatgen you can ignore, those should be calculated automatically in the file _1_clean_cifs.py, but all properties starting with an _ are properties from the ICSD which you should find in each cif. For these properties, just write a script that extracts these properties from the cifs you downloaded and writes them to a csv file, and voila.

From what I see right now, you should extract the following properties: ['_database_code_icsd', '_chemical_formula_sum', '_cell_measurement_temperature', '_diffrn_ambient_temperature', '_chemical_name_structure_type', '_exptl_crystal_density_diffrn', '_chemical_formula_weight', '_cell_length_a', '_cell_length_b', '_cell_length_c', '_cell_angle_alpha', '_cell_angle_beta', '_cell_angle_gamma', '_cell_volume', '_cell_formula_units_z', '_symmetry_space_group_name_H-M', '_space_group_IT_number', '_diffrn_ambient_pressure']

Additionally, there is one property called 'file_id', which should be an absolute path to each cif structure in the directory 3DSC/data/source/ICSD/raw/cifs/. You can see this at the beginning of the code of the function clean_cifs().

From there, you can try to execute the code and see if there comes up any error with an unknown property. I would recommend you to try everything on a small sample of only 100 or so cif files first to speed up this process, which should be as easy as reducing the input csv to just the first 100 rows, since the code just looks up the paths in the csv and then reads in the cifs, but I don't think it ever reads in all the cifs in the directory.

Let me know how it goes!

Best regards, Timo

YanjunLiu2 commented 3 days ago

Hi Timo,

Thank you! This is very detailed. I checked some cifs in the list, but obviously not all of them have this complete set of important_cols. For example in the cif attached, for which the artificial doping will not even be applied, the '_cell_measurement_temperature', '_diffrn_ambient_temperature', 'cif_pymatgen_path', '_chemical_formula_weight', '_diffrn_ambient_pressure' are missing. Will this cause error? Or I can just leave the related cells empty?

To enable uploading I just turn it to txt format.

Best wishes,

Yanjun icsd_291700.txt

TimoSommer commented 3 days ago

Hey Yanjun,

the best way to treat missing entries is usually to set them to numpy.nan or pandas.NA. Pandas knows how to deal with these entries. Setting them to None is not done usually. If you leave them unspecified when generating a pandas dataframe, pandas will automatically make them numpy.nan. You could also check what is done for these cases in Simon's code, but I bet that's what's happening.

Btw, you should not set the cif_pymatgen_path yourself, it will be set in the file _1_clean_cifs.py on line 196. Sorry for mentioning this in my earlier comment by accident, I deleted it from the list there.

Best regards, Timo