Open Syzseisus opened 3 months ago
ps. when I ran the code with commented out that line, the result is as:
Total CID (with raw name) 242673
Total CID (with extracted name) 244717
Total CID 244889
Hi @Syzseisus , this is because we constructed the script in 2022. The PubChem group has been updating this TotalPages
, so this number should be increased now.
BTW. In README, we mentioned this:
python step_01_description_extraction.py. This step extracts and merge all the textual descriptions into a single json file. We run this on May 30th, 2022. The APIs will keep updating, so you may have slightly different versions if you run this script yourself.
Thank you for your quick response!
Hello again, as I mentioned, it seems to be TotalPages=422
as of October 31, 24. However, it does not seem to be a problem that can be applied by simply changing total_page_num
in the file ./preprocessing/PubChemstep_01_description_extraction.py
to 422
. As total_page_num
increases, specific case
s in the clean_up_description
function might also be updated.
However, I know that this update will require a lot of expert time.
Therefore, I would like to ask you to provide the "CID2name_raw.json", "CID2name.json", "CID2text_raw.json", "CID2text.json", "CID2SMILES.csv", "molecules.sdf" files that have been pre-processed on May 30, 2022.
Alternatively, I would like to ask you to provide the "281K chemical structure and text pairs" itself mentioned in the "Results" section on page 3 of the paper among many if-cases
below line 243
in the file ./scripts/pretrain.py
.
Thank you again for your hard work and wonderful research.
Sincerely, Syzseisus
Hi @Syzseisus,
CID2text_raw.json
and CID2text.json
) cannot be released due to the policy issue from PubChem.The specific cases could be different, but at least the special cases discussed in the paper can still be handled using these lines of scripts.
Thank you so much for the incredibly quick response.
I’m reopening an issue, even though the question might seem minor, because I am trying to reproduce the results from your paper.
I suspect that due to the clean_up_description
function not being updated to handle the additional data, the performance could be drop, despite the increase in data.
Given the current situation, what preprocessing steps would you recommend to ensure I get results closer to those in the original paper?
Hi @Syzseisus,
Since the checkpoints have been reproduced, you should be able to reproduce the results on downstream tasks.
I’m aware that you’ve provided a checkpoint for the pretrained model.
However, for my research, I’m looking to reproduce the pretraining process itself.
Thank you for your help.
Dear authors,
Thanks for the exciting work. While working with the code of preprocessing PubChem dataset, I came across a specific line in a file that I find confusing. Could you please clarify its purpose?
File:
./preprocessing/PubChemstep_01_description_extraction.py
Line number: 160 Code on that line:assert description_data["TotalPages"] == total_page_num
I've observed that the variable
total_page_num
is set to290
, but the execution result showsdescription_data["TotalPages"]
as422
. When I commented out this line, the code ran without any issues.I'm not sure why this line is necessary and how it fits into the overall functionality of the script. Understanding its purpose would help me a lot in my current work and in contributing more effectively to the project.
Thank you for your assistance!
Best regards,
Syzseisus