AI-HPC-Research-Team / GIT-Mol

A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text
MIT License
16 stars 1 forks source link

lossing image2d - Data of molecule images in the pretrain stage #1

Open 545487677 opened 5 months ago

545487677 commented 5 months ago

Hi, thank you for sharing such a great work! However, I can't find the image2d file in the dataset. Can you tell me how can I get the dataset? Thank you!!

FZU-LW commented 3 months ago

@AI-HPC-Research-Team I would also like to know how to get the pretrain data. Could you provide these data?

Pengfei-Liu-SYSU commented 2 months ago

I apologize for not being able to upload the extensive pretrain image data to the platform. However, I can guide you through a simpler download process from PubChem:

  1. Obtain the link for Biological Test Results at https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72 and download the compound CIDs.
  2. Divide these CIDs into ten CSV files, each containing 500,000 entries.
  3. Upload these CSV files at https://pubchem.ncbi.nlm.nih.gov/ under 'Upload ID List' to download the corresponding images.

For the pretrain text data, the Mol-Instruction dataset available at https://huggingface.co/datasets/zjunlp/Mol-Instructions offers a more comprehensive dataset that is homologous to ours but with higher standardization.