Desbordante / desbordante-core

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
GNU Affero General Public License v3.0
384 stars 71 forks source link

Refactor TANE-based algorithms #378

Closed iliya-b closed 1 month ago

iliya-b commented 7 months ago

Generalize Tane and PFDTane, add additional tests.

In order to check if the refactoring caused any performance loss, following experiments were performed. The discovery task was run as cli.py --task=afd --algo=tane --error=0.05 --table=... with new and original versions of TANE implementation. Following heavy datasets were utilized: EpicMeds.csv, adult.csv, EpicVitals.csv.

Following list demonstrates measured running time of the old and new algorithms, correspondingly (confidence intervals of 95%, with 10 iterations):

  1. EpicMeds.csv (old) 59.715925465099986 +- 0.1869874511220996
  2. EpicMeds.csv (new) 59.5840122977 +- 0.06763601341304505
  3. adult.csv (old) 24.654166058699996 +- 0.06323832294394492
  4. adult.csv (new) 24.76226707977778 +- 0.09297212157319155
  5. EpicVitals.csv (old) 10.6707755998 +- 0.11612311140862534
  6. EpicVitals.csv (new) 10.7569084586 +- 0.0103879548810794
iliya-b commented 6 months ago

@vs9h I've fixed the architectural issues with Tane and PFDTane algorithms as you suggested in PR #300

iliya-b commented 1 month ago

@vs9h I've fixed the issues with this PR. You mentioned another PR #396 , but that PR is still a draft and it rather introduces a few performance enhances into the algorithm and does not affect the architecture. The current PR blocks some other PRs, that's why I've kept only changes that are related to this PR (refactoring) for this moment. What do you think?

vs9h commented 1 month ago

Also, split commits into at least two (tests in a separate commit)

iliya-b commented 1 month ago

@vs9h I've fixed these issues.