lossing image2d - Data of molecule images in the pretrain stage

545487677 commented 5 months ago

Hi, thank you for sharing such a great work! However, I can't find the image2d file in the dataset. Can you tell me how can I get the dataset? Thank you!!

FZU-LW commented 3 months ago

@AI-HPC-Research-Team I would also like to know how to get the pretrain data. Could you provide these data?

Pengfei-Liu-SYSU commented 2 months ago

I apologize for not being able to upload the extensive pretrain image data to the platform. However, I can guide you through a simpler download process from PubChem:

Obtain the link for Biological Test Results at https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72 and download the compound CIDs.
Divide these CIDs into ten CSV files, each containing 500,000 entries.
Upload these CSV files at https://pubchem.ncbi.nlm.nih.gov/ under 'Upload ID List' to download the corresponding images.

For the pretrain text data, the Mol-Instruction dataset available at https://huggingface.co/datasets/zjunlp/Mol-Instructions offers a more comprehensive dataset that is homologous to ours but with higher standardization.

AI-HPC-Research-Team / GIT-Mol

lossing image2d - Data of molecule images in the pretrain stage #1