Closed recherHE closed 3 years ago
That is a large number of compounds and the jupyter notebook wont be the appropriate entry point for a computation of this scale. The example we use here relies on Apache Spark. You could adapt the code and execute it as a console command with the help of PySpark. It would be something like:
spark-submit --driver-memory=32g --conf spark.driver.maxResultSize=16g input.py config.json
where input.py would be the entry point to your script and config.json is the config file with instructions that the input.py program expects.
We are also preparing a repo that will be released by the end of the month with some relevant code for processing large chunks of data. Also the new release of Reinvent is being prepared here. Still the correct instructions are yet to be provided. The input configuration has changed a little. Im planning on adding some examples/updates in the following days
I am trying to use DATA peparationg demo for a 23 million smi_files,but there was an error Shows that it could not be completed due to memory: java.lang.OutOfMemoryError: GC overhead limit exceeded
Is there another way to pre-process the data? Or how to solve this problem? Thank you in advance for your help!