Open Minys233 opened 4 years ago
Thank you for report. @Minys233
I think you are right. Seems we need to change this line into splitter = split_method_dict[split](seed=seed)
.
https://github.com/chainer/chainer-chemistry/blob/56e83dedb5de9dc9eb08ebf292be9ba76a4883ba/chainer_chemistry/datasets/molnet/molnet.py#L108
JFYI: sorry that we moved to maintenance phase and development is not so active recently. But I can merge the PR for bug fix.
In this function, although there is a
seed=777
argument in the signature like this. https://github.com/chainer/chainer-chemistry/blob/56e83dedb5de9dc9eb08ebf292be9ba76a4883ba/chainer_chemistry/datasets/molnet/molnet.py#L24-L28But it's never passed to any splitter, in the same function, the splitter is called here without
seed
argument: https://github.com/chainer/chainer-chemistry/blob/56e83dedb5de9dc9eb08ebf292be9ba76a4883ba/chainer_chemistry/datasets/molnet/molnet.py#L104-L130Then, in the splitter (here the ScaffoldSplitter), the
seed
argument is stillNone
: https://github.com/chainer/chainer-chemistry/blob/56e83dedb5de9dc9eb08ebf292be9ba76a4883ba/chainer_chemistry/dataset/splitters/scaffold_splitter.py#L62-L65According to the implementation, the
seed=None
means it be initialized by reading data from/dev/urandom
according to the numpy docs. https://github.com/chainer/chainer-chemistry/blob/56e83dedb5de9dc9eb08ebf292be9ba76a4883ba/chainer_chemistry/dataset/splitters/scaffold_splitter.py#L23-L35This bug will cause data split inconsistent across different models and different run, even if we explicitly specify the same seed, and the default seed 777 here is useless.
PS: I use Pycharm debug tool to validate above procedure.