PNNL-CompBio / Snekmer

Pipeline to apply encoded Kmer analysis to protein sequences
BSD 3-Clause "New" or "Revised" License
12 stars 1 forks source link

Negative control fail #39

Closed biodataganache closed 2 years ago

biodataganache commented 2 years ago

Based on weird things observed in #38 I tried running a negative control - where families were assembled by associating random proteins together then running snekmer. This negative control was positive (that is, it failed) and snekmer was able to predict these fake families with high AUCs (see attached results). one.csv three.csv two.csv

I'm adding the fake family files I used for the test and the results files for those fake families to the repo and will point to them here. It'd be good to have someone else run this and also make a new set of fake families to test it to make sure it's really doing this before we rip everything apart.

biodataganache commented 2 years ago

Example negative control files for snekmer input are here: https://github.com/biodataganache/Snekmer/tree/main/resources/negative_control

christinehc commented 2 years ago

This is concerning and I'll take a closer look at this. Some initial thoughts: (1) sample size may be an issue (hence the very large performance difference in splits), (2) if the background_5k file was included in the analysis, the issue may lie there (either the background is not properly being incorporated somehow)

wichne commented 2 years ago

I made two input sets, one a true family of RecA proteins, the other a random collection of proteins that are about the same length (350-370 residues). I made sets that have 140 and 1400 members. I used the 100 random genomes as background. Looking at the score/ subdir, in the neg control file, there are no kmers that have a score > 0.043, and most have scores < 0.008. It seems like a model shouldn't even be built because none of the features (kmers) are sufficiently predictive. How is this handled in the program? Also, it was unclear where the background sequence data set is incorporated into the feature analysis. As for the ROC and PR curves, those also looked unbelievably robust for the random family. Could this be a product of just looking at two families? Is the inclusion of too many features resulting in low specificity? Maybe we should search additional random proteins against the random model and see if they score well? The 1400 is still running. I will look at it tomorrow.

wichne commented 2 years ago

One note: currently in the config.yaml, the default min_rep_thresh is '1', meaning all observed kmers are preserved. For the example I was working with, if I set that value to '0.67' (which is interpreted by the program as a fraction of training set that must contain the kmer), the program now errors out because the feature set is empty. This is the expected and desired behavior (although maybe the program shouldn't die, just not build that family). We should consider either having a more restrictive default value or forcing the user to set one.

biodataganache commented 2 years ago

I think I know what the issue is. The basis set choice is made on the entire input family. The CV is done after that. This means that it's more likely that the positive examples (in family) will have any of these kmers and the other examples may not have any - that means that it's easy to pick out the family examples since their probability score will be non-zero. This will be more of a problem with smaller families and is actually accentuated by randomly choosing family members (I think). I'm still working through this and will have updates soon. The solution is to do a cross-validation where the basis set is chosen based on the training split each time. It's probably more complicated to set up in sklearn than how it is now (which is really slick).

wichne commented 2 years ago

Wait, are you saying the training set included the test set? If so, yes, we need to fix that.

biodataganache commented 2 years ago

Yes - but in a non-obvious way. Since we choose the basis set ahead of time it's not that the training set is included in the test set - but the features we're choosing are mixing things up I think. Still have to do some testing on it to make sure.

biodataganache commented 2 years ago

Here's pseudo code describing how the cross-validation should be done: for i in 1:number of folds -split proteins in to train and test sets -determine basis set for train split -construct feature vectors with basis set -calculate kmer probabilities for train split -score train AND test sets using probabilities -regression on train split -apply parameters from regression to test split -evaluate performance on test split (AUC, etc.)

Unfortunately this will be somewhat complicated given Snekmer organization since it crosses rules. It leaves a question of how the final 'model' will be determined. Probably makes sense to have it be the basis set/probabilities/regression on the entire family - not caring about cross-validation splits.

christinehc commented 2 years ago

I think I know what the issue is. The basis set choice is made on the entire input family. The CV is done after that. This means that it's more likely that the positive examples (in family) will have any of these kmers and the other examples may not have any - that means that it's easy to pick out the family examples since their probability score will be non-zero. This will be more of a problem with smaller families and is actually accentuated by randomly choosing family members (I think). I'm still working through this and will have updates soon. The solution is to do a cross-validation where the basis set is chosen based on the training split each time. It's probably more complicated to set up in sklearn than how it is now (which is really slick).

We did discuss this at a past update meeting, but didn't have an idea of how strongly the basis set selection could impact CV results at the time. Perhaps it makes sense to create a basis set using a pseudo-CV approach:

  1. Randomly split sequences into k folds
  2. Generate kmer feature vectors based on entirety of kmer space (k, alphabet) for each individual fold
  3. Add vectors to use in CV for different combinations of the folds
  4. Generate separate probability scores for each fold combination

However, this suffers the major downside of relying on the entirety of kmer space, which may be quite large depending on the k and alphabet selected.

Another idea is to use the total kmer space for all sequences, and grab the kmers that appear >0 times across all sequences. Then, generate kmers individually per fold (as described above) and use the vectors additively as well.

We can discuss more at today's meeting.

biodataganache commented 2 years ago

Based on discussion on 1/3/2022 we decided on the following course of action: @christinehc will update the code to filter the kmers at each cross-validation to only include those that are in the training set, then calculate the probabilities based only on the training set, before doing evaluation on the testing set. This will be minimally disruptive to the paper.

christinehc commented 2 years ago

Cross-validation changes in b5966cf, implemented as discussed above, have fixed the issue. New results:

one.csv three.csv two.csv