Clarification Needed in Preprocessing PubChem dataset

chao1224 / MoleculeSTM

Multi-modal Molecule Structure-text Model for Text-based Editing and Retrieval, Nat Mach Intell 2023 (https://www.nature.com/articles/s42256-023-00759-6)

https://chao1224.github.io/MoleculeSTM

Other

205 stars 19 forks source link

Clarification Needed in Preprocessing PubChem dataset #29

Open Syzseisus opened 3 months ago

Syzseisus commented 3 months ago

Dear authors,

Thanks for the exciting work. While working with the code of preprocessing PubChem dataset, I came across a specific line in a file that I find confusing. Could you please clarify its purpose?

File: ./preprocessing/PubChemstep_01_description_extraction.py Line number: 160 Code on that line: assert description_data["TotalPages"] == total_page_num

I've observed that the variable total_page_num is set to 290, but the execution result shows description_data["TotalPages"] as 422. When I commented out this line, the code ran without any issues.

I'm not sure why this line is necessary and how it fits into the overall functionality of the script. Understanding its purpose would help me a lot in my current work and in contributing more effectively to the project.

Thank you for your assistance!

Best regards,

Syzseisus

Syzseisus commented 3 months ago

ps. when I ran the code with commented out that line, the result is as:

Total CID (with raw name) 242673
Total CID (with extracted name) 244717
Total CID 244889

chao1224 commented 3 months ago

Hi @Syzseisus , this is because we constructed the script in 2022. The PubChem group has been updating this TotalPages, so this number should be increased now.

BTW. In README, we mentioned this:

python step_01_description_extraction.py. This step extracts and merge all the textual descriptions into a single json file. We run this on May 30th, 2022. The APIs will keep updating, so you may have slightly different versions if you run this script yourself.

Syzseisus commented 3 months ago

Thank you for your quick response!

Syzseisus commented 4 days ago

Hello again, as I mentioned, it seems to be TotalPages=422 as of October 31, 24. However, it does not seem to be a problem that can be applied by simply changing total_page_num in the file ./preprocessing/PubChemstep_01_description_extraction.py to 422. As total_page_num increases, specific cases in the clean_up_description function might also be updated.

However, I know that this update will require a lot of expert time. Therefore, I would like to ask you to provide the "CID2name_raw.json", "CID2name.json", "CID2text_raw.json", "CID2text.json", "CID2SMILES.csv", "molecules.sdf" files that have been pre-processed on May 30, 2022. Alternatively, I would like to ask you to provide the "281K chemical structure and text pairs" itself mentioned in the "Results" section on page 3 of the paper among many if-cases below line 243 in the file ./scripts/pretrain.py.

Thank you again for your hard work and wonderful research.

Sincerely, Syzseisus

chao1224 commented 4 days ago

Hi @Syzseisus,

Four out of six files you mentioned have already been uploaded to this HuggingFace link.
The two other files (CID2text_raw.json and CID2text.json) cannot be released due to the policy issue from PubChem.

The specific cases could be different, but at least the special cases discussed in the paper can still be handled using these lines of scripts.

Syzseisus commented 4 days ago

Thank you so much for the incredibly quick response.

I’m reopening an issue, even though the question might seem minor, because I am trying to reproduce the results from your paper.

I suspect that due to the clean_up_description function not being updated to handle the additional data, the performance could be drop, despite the increase in data.

Given the current situation, what preprocessing steps would you recommend to ensure I get results closer to those in the original paper?

chao1224 commented 4 days ago

Hi @Syzseisus,

Since the checkpoints have been reproduced, you should be able to reproduce the results on downstream tasks.

Syzseisus commented 4 days ago

I’m aware that you’ve provided a checkpoint for the pretrained model.

However, for my research, I’m looking to reproduce the pretraining process itself.

Thank you for your help.