Closed queliyong closed 2 years ago
I don't understand how you can use an RF efficiently directly on SMILES, CNNs or RNNs seem more adequate. However, 27 samples for a deep learning model is VERY small dataset even with augmentation. Look for opportunities to use transfer learning if you insist on using an RNN model, or try out more classical ML approaches such as the fingerprints (also included in molvecgen) plus a simple model such as multiple linear regression. (use regularization such as lasso or ridge models)
This is more a machine learning issue rather than a code issue. I'll close.
I don't understand how you can use an RF efficiently directly on SMILES, CNNs or RNNs seem more adequate. However, 27 samples for a deep learning model is VERY small dataset even with augmentation. Look for opportunities to use transfer learning if you insist on using an RNN model, or try out more classical ML approaches such as the fingerprints (also included in molvecgen) plus a simple model such as multiple linear regression. (use regularization such as lasso or ridge models)
I see.Thanks very much.
Hi, I have a dataset with 27 samples, two columns, 'SMILES' and a molecule property. It was splitted into trainset and testset, 80% and 20%,respectively. Then I used this method to improve the sample size of the trainset, after trained with RandomForestRegressor, the MSE of testset was higher than not using the data augmentation method previously, and the prediction values of testset inclined to be the same, which I guess it must be overfitted. I also tried the RNN modeling presented at the SMILES-enumeration repository, but it did not work either. Are the 27 samples too small? Any help would be appreciated.