adriennekline / psmpy

propensity score matching in python
Other
50 stars 2 forks source link

Running time KNN #11

Open stelviopas opened 1 year ago

stelviopas commented 1 year ago

Hi, thank you very much for the helpful package.

I'd like to know about the implementation of the KNN. A data frame with 10k samples takes approximately 4 minutes, whereas sci-kit learn NearestNeighbors() took me less than a minute. I also have not found any information about the used packages in the paper itself.

Will it be furthermore possible to choose a custom k other than 12 in the knn_matched_12n method?

Thanks in advance! Best, Ana

adriennekline commented 1 year ago

Hi Ana,

so the KNN in my program employs:

from sklearn.neighbors import NearestNeighbors

for the KNN. The time it takes KNN to run is very fast, what is slow is looping through all the matches and logging what has been previously matched and re-indexing appropriately.

You should be able to pick any number in the n in the knn_matched_12n method.

Nadii commented 1 year ago

Hi, thanks for the great package! I have running time issue when the samples are increased to about 30k and the KNN takes up to 30min! When I increase the sample sizes to ~25k for treatment vs ~40k for control, the 1:1 KNN failes running or psm.knn_matched(matcher='propensity_logit', replacement=False, caliper=caliper, drop_unmatched=True) takes hours or even crashes! Not sure what to modify to resolve the running time issue. Any hint would be appreciated. Thanks, Nadi