chao1224 / MoleculeSTM

Multi-modal Molecule Structure-text Model for Text-based Editing and Retrieval, Nat Mach Intell 2023 (https://www.nature.com/articles/s42256-023-00759-6)
https://chao1224.github.io/MoleculeSTM
Other
188 stars 18 forks source link

AssertionError in step 1 in preprocessing #13

Closed toooooodo closed 6 months ago

toooooodo commented 6 months ago

Hi, thank you very much for sharing this great work.

When I run python step_01_description_extraction.py in preprocessing the PubChemSTM dataset in step 1, an assertion error occurred:

Traceback (most recent call last):                                                                                                                                                                                                              
  File "step_01_description_extraction.py", line 162, in <module>                                                                                                                                                                               
    assert description_data["TotalPages"] == total_page_num                                                                                                                                                                                     
AssertionError

I opened the https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/annotations/heading/json?heading_type=Compound&heading=Record+Description&page=0 and found the "TotalPages" is 386. I wonder if this affects any other parts of the data preprocessing code and if there is any plan to provide up-to-date code for processing the data?

chao1224 commented 6 months ago

Hi @toooooodo,

The preprocessing code is the latest by 2022 (around May), and the TotalPages was 290 at that time. The PubChem team has been maintaining their database and adding more textual descriptions, where the TotalPages will change for sure.

For your own usage, please update this to the latest number at each preprocessing step. The other parts should work just fine.

toooooodo commented 6 months ago

Thank you for your prompt reply 😁