EBjerrum / molvecgen

Molecular vectorization and batch generation
MIT License
51 stars 12 forks source link

why does not this data augmentation method work with my dataset? #3

Closed queliyong closed 2 years ago

queliyong commented 2 years ago

Hi, I have a dataset with 27 samples, two columns, 'SMILES' and a molecule property. It was splitted into trainset and testset, 80% and 20%,respectively. Then I used this method to improve the sample size of the trainset, after trained with RandomForestRegressor, the MSE of testset was higher than not using the data augmentation method previously, and the prediction values of testset inclined to be the same, which I guess it must be overfitted. I also tried the RNN modeling presented at the SMILES-enumeration repository, but it did not work either. Are the 27 samples too small? Any help would be appreciated.

EBjerrum commented 2 years ago

I don't understand how you can use an RF efficiently directly on SMILES, CNNs or RNNs seem more adequate. However, 27 samples for a deep learning model is VERY small dataset even with augmentation. Look for opportunities to use transfer learning if you insist on using an RNN model, or try out more classical ML approaches such as the fingerprints (also included in molvecgen) plus a simple model such as multiple linear regression. (use regularization such as lasso or ridge models)

EBjerrum commented 2 years ago

This is more a machine learning issue rather than a code issue. I'll close.

queliyong commented 2 years ago

I don't understand how you can use an RF efficiently directly on SMILES, CNNs or RNNs seem more adequate. However, 27 samples for a deep learning model is VERY small dataset even with augmentation. Look for opportunities to use transfer learning if you insist on using an RNN model, or try out more classical ML approaches such as the fingerprints (also included in molvecgen) plus a simple model such as multiple linear regression. (use regularization such as lasso or ridge models)

I see.Thanks very much.