Pretty straightforward, just test which images are multivariate outliers. I have a working example, but we just need to decide how to threshold it.
def mahalanobis_distance_outliers(data):
'''Calculate mahalanobis distance for each datapoint
Args:
data: (np.array/pd.DataFrame) observations by features
Returns:
distance: (np.array)
'''
data = np.array(data)
data_mean = np.mean(data, axis=0)
data_demeaned = data - data_mean
VI = np.linalg.inv(np.cov(data_demeaned.T))
d = []
for x in range(data.shape[0]):
delta = data[x, :] - data_mean
d.append(np.sqrt(np.dot(np.dot(delta, VI), delta)))
return np.array(d)
Pretty straightforward, just test which images are multivariate outliers. I have a working example, but we just need to decide how to threshold it.