Sum02dean / STRINGSCORE

1 stars 1 forks source link

Identity analysis #42

Open Sum02dean opened 2 years ago

Sum02dean commented 2 years ago

We would like to see if the model is simply learning the identity of the protein names. This could be the case if a given protein "A" gives the same feature vector no matter the other protein it is paired with.

If the outcome variable is always of the same class, or overly representative of one class, the model may simply default to that class during inference, if protein "A" is present in the pair.


Grab all pairs which "A" mutually occurs in <-- group A Grab all pairs which are unique < -- group B

Perform ANOVA test.

Null hypothesis: Their is a no difference between the variances of the two groups. Hypothesis: Their is a difference in the variances between the two groups.

Sum02dean commented 2 years ago

It may not be possible to do an extensive test like this - the idea is whether or not a given protein across all pairs that it is found combined with, gives a characteristic fingerprint across the channel data. I would compute this for all proteins. It could tell us whether there is a signal to be learned by the algorithm for defaulting its predictions based on “identity”.

At the least I can check the distribution of class labels of “group A” and quantify the skewness of the outcome response. E.g. for all pairs containing a given protein, how often is the class label positive vs negative etc (edited)

If we find that there is a correlation between proteins from group-A in the ANOVA test which is shown to be statistically different from group B, and a high predictive accuracy for those pairs that contain the group-A identity protein. It would increase suspicion of the identity problem