SteffanoP / cbdgen

An Evolutionary Scalable Framework for Synthetic Data Generation based in Data Complexity.
MIT License
3 stars 0 forks source link

[Feature] Use `pymfe` for Complexity Data Extraction #42

Closed SteffanoP closed 2 years ago

SteffanoP commented 2 years ago

This PR implements a new Complexity Data Extraction algorithm for the extraction of Complexity Data during evaluation, as well as to establish global references (whenever we're generating synthetic data based on a problem or a real Data Set).

Fixes #40, Fixes #31, Fixes #30.

It also open new possibilities in a few issues, such as #6 and #36.

Problem

Complexity Data is not a simple task, and performance has been noted as one of a few struggles for extraction of complexity data (as reported in #40). Although ECoL does a good job by extracting a vast majority of Complexity Data as one of the state of art for Extraction of Complexity Data, it is not python native. A recent solution for the implementation of Complexity Data Extraction is pymfe. By using this package we're able to extract Complexity Data without a R to python interface and performance improvements were verified.

Objectives

SteffanoP commented 2 years ago

Also, a bug was found about how pymfe outputs its extraction value. Although you can pass a list of complexities that you want to extract. It seems that there's an order on how meta features outputs its value. A Way to reproduce this is by creating a MFE object:

mfe = MFE(groups=['complexity'], features=['C2', 'L2', 'N1', 'F2'])

Take a look at the features that we've put, the order is:

  1. C2
  2. L2
  3. N1
  4. F2

But, when we try to extract the list, this list is sorted in another pattern:

ft = mfe.extract()
print("\n".join("{:50} {:30}".format(x, y) for x, y in zip(ft[0], ft[1])))
>>> c2                                                                            0.0
>>> f2.mean                                                      0.002206548501187337
>>> l2.mean                                                      0.014285714285714271
>>> n1                                                            0.12380952380952381

In this case, the sorting has become:

  1. C2
  2. F2
  3. L2
  4. N1

This is a bug in our implementation, due to the fact that we use the values on ft[1] to calculate the fitness values, and this may not optimize correctly.

SteffanoP commented 2 years ago

Talking about:

Test performance and quality assurance for this implementation;

I've been not able to test performance and quality assurance, due to the fact that I've lost my main setup, whereas I've used to test cbdgen. Therefore, since cbdgen is a development project, I pretend to keep the implementation and warn that performance and quality assurance could not be measured correctly, although a few results that I've got were promising.

SteffanoP commented 2 years ago

Documentation will be addressed in a future update. Keep track in #38