PoonLab / OpenRDP

An open-source re-implementation of the RDP4 recombination detection program
GNU General Public License v3.0
45 stars 9 forks source link

Model Selection #12

Closed kwade4 closed 3 years ago

kwade4 commented 3 years ago

I trained a Random Forest Classifier on the data (202 breakpoints) using an 80-20 test-train split and performed 5-fold cross validation.

So far, the best performing RF model has a cross validation accuracy of about 0.65. The accuracy for the test set is a little better than a guess (0.56), but the training accuracy is very high (approx 0.98).

It seems like the model is overfitting, so I will look into adding more data and see if the training and testing sets are balanced. I will also

kwade4 commented 3 years ago

I expanded my dataset to include 568 signals. Incorporating extra data alone did not change the performance of the model by too much.

I am looking into creating features such as the number of methods where a signal was detected and the standard deviation of the p-values. Adding these features increases the CV accuracy to approx. 0.66. The most notable improvements are in the true positive and false positive rates. With these features, the model has a balanced accuracy of 0.67

kwade4 commented 3 years ago

Add ROC for RDP5.

kwade4 commented 3 years ago

Adding in ROC for RDP5 gives AUROC of 0.5. This makes sense because the dataset is the output from RDP5 (p-values for recombination breakpoint location). So RDP5 is basically a "model" that always predicts true on the dataset.