coleygroup / molpal

active learning for accelerated high-throughput virtual screening
MIT License
159 stars 36 forks source link

[QUESTION]: training dataloader is currently defined with shuffle=False. Is this intentional? #38

Closed coparks2012 closed 1 year ago

coparks2012 commented 2 years ago

Hello,

I have really enjoyed going through the molpal codebase.

While studying the code, I noticed that the default value for shuffle in the MoleculeDataLoader class is False. Currently, train_dataloader used for training the MPN model in the train function of the MPN class in script mpnmodels.py (lines 177-179) is being created with shuffle=False.

train_dataloader = MoleculeDataLoader(
            train_data, self.batch_size, self.ncpu, pin_memory=False
        )

I am wondering if this is intentional? This could potentially lead to overfitting during training. The code below would create the dataloader with shuffle=True.

train_dataloader = MoleculeDataLoader(
            train_data, self.batch_size, self.ncpu, shuffle=True, pin_memory=False
        )
davidegraff commented 2 years ago

Hey @coparks2012 - thanks for poring through the code. You're right that it's weird that we don't shuffle the data and that it might cause training instability based on the ordering of the dataset. However, I can tell you that the MPNN code was adapted from the legacy chemprop code and there were quite a few odd design choices in there. When I wrote this, my goal was to simply interface the two codebases with as little changes to chemprop as possible. Of course, it resulted in instances like this, but it seems to work fairly well across a variety of tasks. If we were to change it, we'd have to redo a bunch of the original studies to assess the impact (positive, negative, or neutral) of shuffling on overall optimization performance, and that's just something I've never gotten around to.