ccdmb / predector

Effector prediction pipeline based on protein properties.
Apache License 2.0
11 stars 7 forks source link

Implement effector classification, clustering, and ranking strategy #22

Closed darcyabjones closed 4 years ago

darcyabjones commented 4 years ago

This is required for the version 1 release. Progress is being tracked in the project classifier and ranking....

darcyabjones commented 4 years ago

Essentially we need a way of deciding what looks like an effector and what doesn't.

There are 3 methods that we've discussed to do this:

  1. A ML meta-classifier that takes the results of the analyses and reports a "probability" [0-1] of effector-ness.
    1. A ranking method, where we manually assign weights to results from analyses based on how important we think they are and use the sum as the effector score. Essentially, this would be a logistic regression classifier without the logistic bit and with manual coefficients.
    2. Hierarchical clustering of the results. This has the benefit of identifying groups of proteins with common features, and does not rely on us defining what an effector looks like a priori.

Note that with the ML classifier, we wouldn't be able to include user-supplied data (e.g. positive selection; unless we then include that analysis in the pipeline). For the manual weights method, users would have to supply the weights and normalise their own data.

I think what we'll go for is a combination of all three methods.

  1. Calculate ML and weighted scores for each protein.
  2. Cluster the protein results to identify groups at some cutoff threshold to be determined.
  3. Identify clusters of effectors with a high average or median effector score.
darcyabjones commented 4 years ago

For now we're only progressing with the ranking method. Which James and I are currently finalising.

The ML method that I'd like to implement is a learning to rank solution. Especially, the lambdaMART implementation in xgboost. Boosted trees have a few nice properties that work for us here. Especially ease of interpretability, and the ability to weight samples which we can use to overcome the class imbalance.

darcyabjones commented 4 years ago

This has now been done. We use a learning to rank method. It's much more reliable that the manual scores. Particularly the one without homology.