ScionKim / FaissImputer

Missing data imputation using Faiss for enhanced data quality.
MIT License
2 stars 0 forks source link

Benchmark vs sklearn function #2

Open Zethson opened 9 months ago

Zethson commented 9 months ago

Did you ever benchmark this implementation against sklearn's KNN imputation?

ScionKim commented 9 months ago

Thank you for your question regarding the FAISSimputer benchmark against sklearn's KNN imputation. Here are my thoughts:

  1. Operational/Time Costs with Large Data Sets: In my experience with the FAISSimputer, there are significant savings in operational and time costs, especially noticeable with larger datasets. This is a key advantage over traditional methods, focusing on scalability and efficiency.
  2. Performance on Train/Test Sets: The performance does vary based on the nature of the data. As the current version is the initial release, I'm aware of the potential for substantial improvements in this area. The performance metrics are subject to change based on different dataset characteristics, and I'm working on optimizing this.

Additionally, I recognize that user customization is essential. To address this, I plan to make improvements that allow users to fine-tune various options beyond just the choice of the algorithm itself. This will allow for better tailoring of the imputer according to specific user needs and data characteristics.

This initial release is just a stepping stone, and I look forward to evolving the tool with user feedback and ongoing research.

Zethson commented 9 months ago

Awesome!

  1. Do you have numbers that you can share for the runtime improvements?
  2. The performance shouldn't really be thaaaaat different right? I'd expect the KNN graphs to be rather similar, especially for small datasets