basf / MolPipeline

MIT License
89 stars 4 forks source link

RAM issue in MolToDescriptorPipelineElement when standardizer not None #23

Open JochenSiegWork opened 2 months ago

JochenSiegWork commented 2 months ago

I tried to process a data set of 1.4M molecules with a small Pipeline looking like this:

pipeline = Pipeline(
            [
                ("smi2mol", SmilesToMol()),
                ("net_charge_element", MolToNetCharge()),  # MolToNetCharge inherits from MolToDescriptorPipelineElement
            ])

This leads to RAM issues because Molpipeline simultaneously tries to fit the RDKit data structures for all 1.4M molecules into the RAM. This happens because Molpipeline splits the pipeline elements into syncing and non-syncing parts during the instance-based processing splitting.

In the constructor of MolToDescriptorPipelineElement, the _requires_fitting is set when the standardizer is not None:

  if self._standardizer is not None:
            self._requires_fitting = True

The RAM issues can be avoided by doing this:

pipeline = Pipeline(
            [
                ("smi2mol", SmilesToMol()),
                ("net_charge_element", MolToNetCharge(standardizer=None)),
            ])

It would be better to have the standardization in a way that does not lead to RAM issues.