awslabs / dgl-lifesci

Python package for graph neural networks in chemistry and biology
Apache License 2.0
731 stars 151 forks source link

num_worker training dependency #196

Open hnbabaei opened 2 years ago

hnbabaei commented 2 years ago

Hi mufeili, I have a couple of question which I appreciate it if you could help with. -Changing the number of workers changes the number of epochs required to converge which is not expected. Increasing # of CPUs increases the training time. Any advice on why these happen?

-Could we use graph.bin file generated previously to start training without loading graph from a .csv file?

Thanks.

mufeili commented 2 years ago

Hi, which example are you talking about?

hnbabaei commented 2 years ago

Hi, the property_prediction with csv_data_configuration. I used the regression_train.py code.

mufeili commented 2 years ago

Sorry for the late reply.

Changing the number of workers changes the number of epochs required to converge which is not expected. Increasing # of CPUs increases the training time. Any advice on why these happen?

Have you eliminated all sources of randomness? By default regression_train.py does not do so like fixing the random seed. Without eliminating randomness, we cannot perform a fair comparison.

Could we use graph.bin file generated previously to start training without loading graph from a .csv file?

Yes, you can set load=True here.

hnbabaei commented 2 years ago

Thanks very much for your response.

I try to use my own splitting for the Train/test/val sets which are based on splitting 0 and 1 labels separately. I have a column that has the splitting. Is there an easy way to do this? To do this, currently, I have added the following lines to the classification_train.py and regression_train.py:

added -ttvc (--train-test-val-col) argument which indicates column for train-test-val split labels

parser.add_argument('-ttvc', '--train-test-val-col', default=None, type=str,
                    help='column for train-test-val split labels. If None, we will use '
                         'the default method in dgllife for splitting.'
                         '(default: None)')

And here is the change I made where the data gets read and splitting done:

if args['train_test_val_col'] is not None:
    train_set = load_dataset(args, df[df[args['train_test_val_col']]=='train'])
    test_set = load_dataset(args, df[df[args['train_test_val_col']]=='test'])
    val_set = load_dataset(args, df[df[args['train_test_val_col']]=='valid'])
else:
    train_set, val_set, test_set = split_dataset(args, dataset)

Thanks

hnbabaei commented 2 years ago

I actually found the SingleTaskStratifiedSplitter class which I think will do what I found but did not see it in the options for splitting method. I will try to use it. Please let me know if you think this is a correct way to do it.

mufeili commented 2 years ago

That should work. Feel free if you encounter any further issues.

hnbabaei commented 2 years ago

Thanks Mufei. Just wondering if the code has been ever used for large scale datasets(e.g., 100 million molecules). If so, what you suggest to use or change within the code to make it scalable and memory efficient? Thanks.

mufeili commented 2 years ago

Thanks Mufei. Just wondering if the code has been ever used for large scale datasets(e.g., 100 million molecules). If so, what you suggest to use or change within the code to make it scalable and memory efficient? Thanks.

I have not tested the code for that scale. Likely you will need to check if you have enough memory to load the data at once or alternatively load the data in batches. You will also need more computational resources, e.g., multi-GPU training. The example here might help.