AlexanderKroll / ProSmith

MIT License
22 stars 9 forks source link

The number of data points in ESP dataset #9

Closed Hong-yu-Zhang closed 1 month ago

Hong-yu-Zhang commented 2 months ago

Dear Alex: Thanks for the good work! I am training my own ESI model using ESP dataset. The number of data points in training set reported in the paper is 850,291. However, I found there are only 765639 data points in "ESP\train_val_phylo\train_df.csv" and 50098 data points in "ESP\train_val\train_df.csv". Are there some data preprocessing procedures I missed? When training a new model, I also want to consult that is it appropriate to concat "ESP\train_val_phylo\train_df.csv" and "ESP\train_val\train_df.csv" to make a fair comparison with models like ESP and Prosmith?

AlexanderKroll commented 1 month ago

Hi Hong-yu-Zhang,

Sorry for my late reply! Cool that you are working on a new ESI model! Can you further improve the performance of the model? If you have a preprint, I would be very interested in reading it. Maybe you can send it to me when you have something?

Thanks for pointing out the discrepancy between the reported number of data points and the number of data points in the dataset. At first glance I am not sure why this is the case, I will look into it further. The dataset at "ESP\train_val\train_df.csv" contain already all training data points with experimental and phylogenetic evidence, and hence you do not need to concatenate both datatsets. And you should also note that the phylo dataset was only used to train the ProSmith Transformer network, not to train the gradient boosting models.