how to use DATA preparation demo more efficient

recherHE commented 3 years ago

I am trying to use DATA peparationg demo for a 23 million smi_files,but there was an error Shows that it could not be completed due to memory: java.lang.OutOfMemoryError: GC overhead limit exceeded

Is there another way to pre-process the data? Or how to solve this problem? Thank you in advance for your help!

patronov commented 3 years ago

That is a large number of compounds and the jupyter notebook wont be the appropriate entry point for a computation of this scale. The example we use here relies on Apache Spark. You could adapt the code and execute it as a console command with the help of PySpark. It would be something like:

spark-submit --driver-memory=32g --conf spark.driver.maxResultSize=16g input.py config.json

where input.py would be the entry point to your script and config.json is the config file with instructions that the input.py program expects.

patronov commented 3 years ago

We are also preparing a repo that will be released by the end of the month with some relevant code for processing large chunks of data. Also the new release of Reinvent is being prepared here. Still the correct instructions are yet to be provided. The input configuration has changed a little. Im planning on adding some examples/updates in the following days

MolecularAI / ReinventCommunity

how to use DATA preparation demo more efficient #4