chainer / chainer-chemistry

Chainer Chemistry: A Library for Deep Learning in Biology and Chemistry
MIT License
618 stars 129 forks source link

Random seed is useless in get_molnet_dataset function #418

Open Minys233 opened 4 years ago

Minys233 commented 4 years ago

In this function, although there is a seed=777 argument in the signature like this. https://github.com/chainer/chainer-chemistry/blob/56e83dedb5de9dc9eb08ebf292be9ba76a4883ba/chainer_chemistry/datasets/molnet/molnet.py#L24-L28

But it's never passed to any splitter, in the same function, the splitter is called here without seed argument: https://github.com/chainer/chainer-chemistry/blob/56e83dedb5de9dc9eb08ebf292be9ba76a4883ba/chainer_chemistry/datasets/molnet/molnet.py#L104-L130

Then, in the splitter (here the ScaffoldSplitter), the seed argument is still None: https://github.com/chainer/chainer-chemistry/blob/56e83dedb5de9dc9eb08ebf292be9ba76a4883ba/chainer_chemistry/dataset/splitters/scaffold_splitter.py#L62-L65

According to the implementation, the seed=None means it be initialized by reading data from /dev/urandom according to the numpy docs. https://github.com/chainer/chainer-chemistry/blob/56e83dedb5de9dc9eb08ebf292be9ba76a4883ba/chainer_chemistry/dataset/splitters/scaffold_splitter.py#L23-L35

This bug will cause data split inconsistent across different models and different run, even if we explicitly specify the same seed, and the default seed 777 here is useless.

PS: I use Pycharm debug tool to validate above procedure.

corochann commented 4 years ago

Thank you for report. @Minys233

I think you are right. Seems we need to change this line into splitter = split_method_dict[split](seed=seed). https://github.com/chainer/chainer-chemistry/blob/56e83dedb5de9dc9eb08ebf292be9ba76a4883ba/chainer_chemistry/datasets/molnet/molnet.py#L108

JFYI: sorry that we moved to maintenance phase and development is not so active recently. But I can merge the PR for bug fix.