BERENZ commented 3 years ago

Summary

I would like to verify the performance of Faiss for Predictive Mean Matching / Imputation (see https://stefvanbuuren.name/fimd/sec-pmm.html).

Basicaly, I am opperating on the 1-D vectors that are used for indexing and query (i.e. imputation of missing data). However, I noticed that a slight change in the function that generates missing data leads to very different results if I use faiss with FlatL2 index without voronoi partitioning (so brute force). There are two functions that generate probability of non-missing data from logistic regression defined in the codes at the end of this issue (and in notebook on colab) and compied below (column rho1 and rho2). The difference is that in the one function I have - hh.income_scaled and in the second + hh.income_scaled.

hh["rho1"] = np.exp(1 - hh.income_scaled - hh.d36_scaled - (hh.locality==1) - (hh.noper == 1)) / (1 + np.exp(1 - hh.income_scaled - hh.d36_scaled - (hh.locality==1) - (hh.noper == 1)))
hh["rho2"] = np.exp(1 + hh.income_scaled - hh.d36_scaled - (hh.locality==1) - (hh.noper == 1)) / (1 + np.exp(1 + hh.income_scaled - hh.d36_scaled - (hh.locality==1) - (hh.noper == 1)))

Then, I sample data from Bernoulli distribution based on rho1 and rho2, estimate the target quantity (i.e. mean expenditure) without and with imputation. Then I calculate three performance measures: bias is the diffeerence between estimated and true value, se is standard deviation of my estimates and mse = bias^2 + var of my estimates based on complete cases and 3 different algorithms. Please se these results below

algorithm	bias_rho1	se_rho1	mse_rho1	bias_rho2	se_rho2	mse_rho2
Naive	-291.2495	4.8007	84849.2911	147.1324	4.8267	21671.2532
K-D tree	-1.7060	23.4887	554.6310	29.6011	6.5790	919.5091
Faiss Flat L2	-2.5597	23.8529	575.5144	19.6783	6.7325	432.5618
Faiss Flat L2 voronoi	-1.7653	23.5718	558.7451	29.6241	6.5713	920.7664

Do you know what is the reason for this behaviour for Faiss Flat L2? One time estimator is biased but for the second function is the significantly better. Especially that K-D tree does exact search so how it may be better? Is it connected with using 1-D vectors? Other algorithms such as K-D tree, Annoy, Flann, N2 works well and as expected.

Do you have any recommendation regarding 1-D data?

Platform

Google Colab

Running on:

[ ] CPU
[x ] GPU

Interface:

[ ] C++
[ x] Python

Reproduction instructions

Codes to reproduce this issue are in the JupyterNotebook that should be run with GPU acceleration (e.g. on colab).

https://colab.research.google.com/drive/1lhANDL68y4O4M50Zwg-5r8dp_Dq5MQ4A?usp=sharing

To run this example you should load the data that is attached to this issue. households.xlsx

The full code is below

!apt install libomp-dev
!pip install faiss-gpu

## standard modules
import pandas as pd
import numpy as np

## linear regression
from sklearn.linear_model import LinearRegression

## ann modules
import faiss
from scipy.spatial import cKDTree

## model matrix
from patsy import dmatrices

def kdtree_impute(y_pred, y_pred_miss, y, eps = 0):
  tree = cKDTree(y_pred, leafsize = 100, balanced_tree=True)
  dists, indx = tree.query(y_pred_miss, k = 1, eps = eps)
  res = (np.sum(y) + np.sum(y[indx])) / (len(y_pred) + len(y_pred_miss))
  return res

def faiss_impute(y_pred, y_pred_miss, y, gpu = True, voronoi = False):
  index_flat = faiss.IndexFlatL2(1)

  if voronoi:
    index_flat = faiss.IndexIVFFlat(index_flat, 1, 1000)

  if gpu:
    gpu_faiss = faiss.StandardGpuResources() 
    index_flat = faiss.index_cpu_to_gpu(gpu_faiss, 0, index_flat)

  if voronoi:
    index_flat.train(y_pred)

  index_flat.add(y_pred)
  dists, indx = index_flat.search(y_pred_miss, k = 1) 
  res = (np.sum(y) + np.sum(y[indx.flatten()])) / (len(y_pred) + len(y_pred_miss))
  return res

hh = pd.read_excel("households.xlsx")
hh = hh[hh.income > 0]
hh["income_scaled"] = ((hh.income) - np.mean(hh.income))/np.std(hh.income)
hh["expenditure_scaled"] = ((hh.expenditure) - np.mean(hh.expenditure))/np.std(hh.expenditure)
hh["d36_scaled"] = ((hh.d36) - np.mean(hh.d36))/np.std(hh.d36)
hh["bio"] = hh["bio"].astype("category")
hh["d61"] = hh["d61"].astype("category")
hh["locality"] = hh["class"].astype("category")
hh["noper"] = np.round(hh["noper"])
hh["noper"] = np.where(hh["noper"] >= 5, 5, hh["noper"])
hh["noper"] = hh["noper"].astype("category")
hh["rho1"] = np.exp(1 - hh.income_scaled - hh.d36_scaled - (hh.locality==1) - (hh.noper == 1)) / (1 + np.exp(1 - hh.income_scaled - hh.d36_scaled - (hh.locality==1) - (hh.noper == 1)))
#hh["rho1"] = np.exp(1 - hh.income_scaled) / (1 + np.exp(1 - hh.income_scaled))
hh["rho2"] = np.exp(1 + hh.income_scaled - hh.d36_scaled - (hh.locality==1) - (hh.noper == 1)) / (1 + np.exp(1 + hh.income_scaled - hh.d36_scaled - (hh.locality==1) - (hh.noper == 1)))
#hh["rho2"] = np.exp(1 + hh.income_scaled) / (1 + np.exp(1 + hh.income_scaled))
y, X = dmatrices('expenditure ~ income_scaled + d36_scaled + locality + noper', data=hh, return_type='dataframe')
#y, X = dmatrices('expenditure ~ income_scaled', data=hh, return_type='dataframe')
X.head()

R = 500

sim_ann_rho1 = np.zeros(shape = (R, 4))
sim_ann_rho2 = np.zeros(shape = (R, 4))

for r in range(R):

  if (r % 100 == 0):
    print(r)

  np.random.seed(r)
  response_flag = np.random.binomial(n=1, p = hh.rho1, size = len(hh.rho1))
  y_obs = np.array(y[response_flag == 1]).astype('float32')
  X_obs = X[response_flag == 1].astype('float32')
  X_miss = X[response_flag != 1].astype('float32')
  m1_reg_y1 = LinearRegression().fit(X_obs, y_obs)
  m1_resp_y1_predict = m1_reg_y1.predict(X_obs).reshape(-1,1).astype('float32')
  m1_noresp_y1_predict = m1_reg_y1.predict(X_miss).reshape(-1,1).astype('float32')

  sim_ann_rho1[r, 0] = np.mean(y_obs)
  sim_ann_rho1[r, 1] = kdtree_impute(m1_resp_y1_predict, m1_noresp_y1_predict, y_obs, eps = 0)
  sim_ann_rho1[r, 2] = faiss_impute(m1_resp_y1_predict, m1_noresp_y1_predict, y_obs)
  sim_ann_rho1[r, 3] = faiss_impute(m1_resp_y1_predict, m1_noresp_y1_predict, y_obs,voronoi = True)

  np.random.seed(r)
  response_flag = np.random.binomial(n=1, p = hh.rho2, size = len(hh.rho2))
  y_obs = np.array(y[response_flag == 1]).astype('float32')
  X_obs = X[response_flag == 1].astype('float32')
  X_miss = X[response_flag != 1].astype('float32')
  m1_reg_y1 = LinearRegression().fit(X_obs, y_obs)
  m1_resp_y1_predict = m1_reg_y1.predict(X_obs).reshape(-1,1).astype('float32')
  m1_noresp_y1_predict = m1_reg_y1.predict(X_miss).reshape(-1,1).astype('float32')

  sim_ann_rho2[r, 0] = np.mean(y_obs)
  sim_ann_rho2[r, 1] = kdtree_impute(m1_resp_y1_predict, m1_noresp_y1_predict, y_obs, eps = 0)
  sim_ann_rho2[r, 2] = faiss_impute(m1_resp_y1_predict, m1_noresp_y1_predict, y_obs)
  sim_ann_rho2[r, 3] = faiss_impute(m1_resp_y1_predict, m1_noresp_y1_predict, y_obs,voronoi = True)

results_real = pd.DataFrame(
    {
     "bias_rho1" : np.array(np.mean(sim_ann_rho1, axis = 0) - np.mean(hh.expenditure)),
     "se_rho1"   : np.std(sim_ann_rho1, axis = 0),
     "mse_rho1"  : (np.mean(sim_ann_rho1, axis = 0) - np.mean(hh.expenditure))**2 + np.var(sim_ann_rho1, axis = 0),
     "bias_rho2" : np.array(np.mean(sim_ann_rho2, axis = 0) - np.mean(hh.expenditure)),
     "se_rho2"   : np.std(sim_ann_rho2, axis = 0),
     "mse_rho2"  : (np.mean(sim_ann_rho2, axis = 0) - np.mean(hh.expenditure))**2 + np.var(sim_ann_rho2, axis = 0),

    }, 
    index   = ["Naive", "K-D tree", "Faiss", "Faiss v"], 
    columns = ['bias_rho1', 'se_rho1', 'mse_rho1', 'bias_rho2', 'se_rho2', 'mse_rho2']
    )

np.round(results_real, 4)

wickedfoo commented 3 years ago

Faiss is definitely not designed for low-dimensional (dims < 20, say) data, especially not scalar 1D data. The overhead from the way that Faiss performs its computations would be extreme on scalar data (i.e., it could probably compute neighbors for 32-dim vectors in the same time that it would take to compute scalar distances).

The distances returned are also squared distances, could that be part of your issue?

BERENZ commented 3 years ago

But actually, what do you mean by "not designed for low-dimensional (dims < 20, say) data, especially not scalar 1D data" in, say, mathematical terms? In the meantime, I noticed that the distance function is defined differently (https://gist.github.com/mdouze/efc94c57e2302469287b9d1a2501d277). I would like to understand the reason why it behaves in such a way.

The distance metric should not be as problematic as it does not change between rho1 and rho2 so I would expect that it biased in the same direction.

Edit: The above statement about distance is actually confirmed by using voronoi where with the results are similar.

wickedfoo commented 3 years ago

"Not designed for" means "not optimized for", but the mathematics should be the same as the other exact distance implementations (it tries to compute the same thing), but the order of operations in floating point may differ which could cause differences (fp is not associative, etc), and squared distances are reported for L2 instead of sqrt(sum_i (x_i - y_i)^2).

What is your naive implementation? IndexFlat is basically a naive though optimized implementation though with a different order of floating point operations (it uses optimized GEMM kernels to perform the dirty work). I would expect the two to be fairly similar.

facebookresearch / faiss

Strange behaviour of FAISS FlatL2 Index with 1-D data #1709

Summary

Platform

Reproduction instructions