BigDaMa / abstraction-layer

Apache License 2.0
4 stars 8 forks source link

Feature request #1

Closed FelixNeutatz closed 5 years ago

FelixNeutatz commented 6 years ago

Hi,

I really like the idea of this project and I also have a couple of ideas how to extend this library.

During my work on my thesis I recognized that calculating the F1-score / precision / recall given the ground-truth is not always that trivial and it would be great if you could add this function to the library.

So, the idea is that instead of only providing the library with the dirty dataset, we additionally provide it with the clean ground truth as well. If there is ground truth available the library would return all metrics that have been requested, such as F1-score / precision / recall ...

Another idea I had is to think about whether we provide some process for hyperparameter optimization for quantitative methods, such as dBoost.

It would be also great if there would be a short tutorial how to call the librarie's Python API for each of the supported tools. For example, I do not fully understand how the OpenRefine Tool is implemented.

Again, thanks for this great project.

Best regards, Felix

m-mahdavi commented 6 years ago

Hi Felix,

Thanks for your good comments.

  1. I see your point regarding to add other useful services such as evaluating data cleaning job by the precision, recall, and F1. Eventually, we will probably develop a comprehensive framework to provide different data cleaning services. This module is just a beginning.
  2. About hyperparameter optimization, again as a final goal, our framework would be able to recommend the best tools and parameters for each dataset. You can consider it as some kind of hyperparameter optimization.
  3. About the providing more tutorial for running different tools, you are right. The documentation should be more complete.
  4. Thank you for sending your sample codes for calling OpenRefine. I did not know that OpenRefine has Python interface. We will use that.

Again, thank you for feedback.

Kind regards, Mohammad

FelixNeutatz commented 6 years ago

Sounds great :)

I had another idea: It would be also great to provide optimal parameter configurations for well known datasets, such as Hospital, Flights, ... for each supported tool.

E.g. the corresponding functional dependencies for Nadeef for Hospital, the best performing parameters for each dBoost algorithm ...

m-mahdavi commented 5 years ago

As we discussed, calculation of effectiveness measures are added.