Closed toooooodo closed 6 months ago
Hi @toooooodo,
The preprocessing code is the latest by 2022 (around May), and the TotalPages was 290 at that time. The PubChem team has been maintaining their database and adding more textual descriptions, where the TotalPages will change for sure.
For your own usage, please update this to the latest number at each preprocessing step. The other parts should work just fine.
Thank you for your prompt reply 😁
Hi, thank you very much for sharing this great work.
When I run
python step_01_description_extraction.py
in preprocessing the PubChemSTM dataset in step 1, an assertion error occurred:I opened the
https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/annotations/heading/json?heading_type=Compound&heading=Record+Description&page=0
and found the "TotalPages" is 386. I wonder if this affects any other parts of the data preprocessing code and if there is any plan to provide up-to-date code for processing the data?