Neuraxio / Neuraxle

The world's cleanest AutoML library ✨ - Do hyperparameter tuning with the right pipeline abstractions to write clean deep learning production pipelines. Let your pipeline steps have hyperparameter spaces. Design steps in your pipeline like components. Compatible with Scikit-Learn, TensorFlow, and most other libraries, frameworks and MLOps environments.
https://www.neuraxle.org/
Apache License 2.0
608 stars 62 forks source link

Bug: 25% of execution time is consumed by parsing hyperparameter spaces #432

Closed guillaume-chevalier closed 3 years ago

guillaume-chevalier commented 3 years ago

Describe the bug 25% of the execution time is consumed in the RecursiveDict, especially when initializing steps.

To Reproduce Create a ColumnTransformer containing a few steps.

Expected behavior The recursive dict by nature is supposed to be an optimized data structure, but the current implementation is flawed in several ways. Completely rewriting the RecursiveDict's __getitem__ method, as well as the flattening and unflattening methods, is a must.

The Recursive Dict should be rarely flattened, but for now, it is flattened everywhere in an OCD fashion inside and outside the base step. I suggest that we never flatten it from inside the base step. I also suggest that we only flatten it at the very end of an apply method... but thinking twice, it should not even be flattened there.

Also, the __getitem__ and get should only dig in itself rather than looping, and assuming that self is nested rather than flat. E.g.:

def __getitem__(self, key):
    if key is None: 
        return self
    lkey, _, rkey = key.partition(self.separator)
    return OrderedDict.__getitem__(self, lkey)[rkey]

Also, when re-creating a RecursiveDict from an existing recursive dict or other data structure, there should be more type checking to avoid re-doing the full recursion uselessly by assuming that if the root RecursiveDict is fine, that its children are as well.

Moreover, the recursive dict can be returned without being flattened to a primitive dict, because if someone try to getitem with a key that has separators, the getitem method as described above will simply dig through self as per the return OrderedDict.__getitem__(self, lkey)[rkey] above.

Fixing that will give back to Neuraxle all its speed. This RecursiveDict class is very pure and simple: it should be rather easy to be implemented correctly so that it reaches its intended goal of being optimized. Right now this class reparses itself millions of times within the construction of a regular ML pipeline, which is hard to even picture.