bittremieux / ANN-SoLo

Spectral library searching using approximate nearest neighbor techniques.
Apache License 2.0
42 stars 19 forks source link

Rescore SSMs using semi-supervised learning #19

Closed issararab closed 2 years ago

issararab commented 2 years ago

Rescore SSMs using semi-supervised learning with mokapot.

Contributions:

bittremieux commented 2 years ago

@wfondrie We're making progress with the mokapot integration. 🙂 What is your experience with the need to perform hyperparameter tuning of classifiers (incl model choice) inside of mokapot? Our initial tests indicated a significant performance difference between a linear SVM and RF.

wfondrie commented 2 years ago

It depends on the particular search engine and features, but I also tend to see performance improvements with other types of tree-based models (XGBoost). That's not always the case though - in particularly, the linear SVM works pretty well for a standard DB search with Tide.

As for tuning hyperparameters, sklearn's GridSearchCV and RandomizedSearchCV should be fully compatible for use as the estimator in mokapot.Model(). As for choosing between multiple types of models, you'll have to write something more custom. As long as it acts like sklearn's GridSearchCV, it should work though!

bittremieux commented 2 years ago

Ok, that matches our evaluations so far. I was mainly wondering whether there are specific best practices or not.

I think we'll stick to a random forest rather than XGBoost, because RF is less sensitive to hyperparameters. We'll have the option to switch between RF (default) and a Percolator-like SVM. RF seemed to do especially better for the open search results, which intuitively makes sense because a non-linear classifier might be able to better capture the different groups.

Do you maybe also have some suggestions in terms of features? We have a decent set already, but additional information-rich features is always better of course.

wfondrie commented 2 years ago

Do you maybe also have some suggestions in terms of features?

Not really other than the usual suspects---make sure that charge is included (preferably one-hot encoded).

The real key is just to make sure we don't include features that compromise the integrity of the FDR estimates. These are typically features that describe peptide properties in the database search setting, like amino-acid or k-mer frequencies, but I'm not sure what analogous feature would be for the library search.

I would recommend testing your final feature set empirically with an entrapment experiment to verify that we're not getting liberal FDR estimates due to such a feature (or more sneaky, a combination of features). I've never done this for a library search, but my inclination would be to append another larger spectral library to the target library and use that as the entrapment database.