This leads to RAM issues because Molpipeline simultaneously tries to fit the RDKit data structures for all 1.4M molecules into the RAM. This happens because Molpipeline splits the pipeline elements into syncing and non-syncing parts during the instance-based processing splitting.
In the constructor of MolToDescriptorPipelineElement, the _requires_fitting is set when the standardizer is not None:
if self._standardizer is not None:
self._requires_fitting = True
I tried to process a data set of 1.4M molecules with a small Pipeline looking like this:
This leads to RAM issues because Molpipeline simultaneously tries to fit the RDKit data structures for all 1.4M molecules into the RAM. This happens because Molpipeline splits the pipeline elements into syncing and non-syncing parts during the instance-based processing splitting.
In the constructor of
MolToDescriptorPipelineElement
, the_requires_fitting
is set when the standardizer is not None:The RAM issues can be avoided by doing this:
It would be better to have the standardization in a way that does not lead to RAM issues.