Neuraxio / Neuraxle

The world's cleanest AutoML library ✨ - Do hyperparameter tuning with the right pipeline abstractions to write clean deep learning production pipelines. Let your pipeline steps have hyperparameter spaces. Design steps in your pipeline like components. Compatible with Scikit-Learn, TensorFlow, and most other libraries, frameworks and MLOps environments.
https://www.neuraxle.org/
Apache License 2.0
608 stars 62 forks source link

Recursive dict compress feature #502

Closed Rohith295 closed 2 years ago

Rohith295 commented 3 years ago

What it is

My pull request does: This change will try to compress the flat list of hyperparams, into a more structured representation. Please refer the following issue #486 for more information. I haven't added any tests for now, i wanted to make sure this implementation is reviewed.

How it works

I coded it this way: Current implementation will try to read the all_hps = pipeline.get_hyperparams() output and try to compress this particular output into more structured and readable format.

Example usage

Here is how you can use this new code as a end user:

Note: 
Please make dimensions and types clear to the reader.
E.g.: in the event fictious data is processed in this code example.
>>>all_hps = pipeline.get_hyperparams()\
>>>all_hps_shortened = all_hps.compress()
>>>print(type(all_hps_shortened))
...<class 'neuraxle.hyperparams.space.CompressedHyperparameterSamples'>
>>>pprint(all_hps_shortened)
...[
    {

        "step_name": "step1",
        "hyperparams": {'copy': True, 'iterated_power': 'auto'},
        "ancestor_steps": ["root"]
    },
    {

        "step_name": "step2",
        "hyperparams": {'copy': True, 'iterated_power': 'auto'},
        "ancestor_steps": ["root"]
    }

]
cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
Rohith295 commented 3 years ago

@guillaume-chevalier I have followed all the steps as mentioned for configuring my email address as part of signing commits with email address. It still fails.

guillaume-chevalier commented 3 years ago

Hi @Rohith295, I see that your new commits seems to be properly configured. The CLA-BOT seems to still bug because that the first commit remained unidentified and was not recommited. I know for sure however that this contribution is yours as per our discussion this weekend, and you also have signed the CLA and I can confirm that it is you as well, so the PR could still be merged. Let's ignore the CLA-BOT for now. I'll proceed to the code review. Thank you !

Rohith295 commented 3 years ago

Here is your review !

An usage example with a test would be very good. 

Thank you!

Yes, I will be adding as part of the unit tests.

pull-checklist[bot] commented 3 years ago

New contributor? Ensure you do this.

cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
Rohith295 commented 3 years ago

Unit tests are in progress

cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
Rohith295 commented 3 years ago

@guillaume-chevalier updated PR with tests aswell.

cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
Rohith295 commented 3 years ago

Here are some review comments, mainly regarding the unit test and software design.

Also I know that @vincent-antaki didn't have the time to finish his review friday. He will come back next Monday. I think he told me quickly that the present algorithm didn't cover every cases and was flawed in the sense that it was good at computing the compressed hyperparameters for the given unit test, but that your algorithm seemed not complete and would fail for more complicated wildcard patterns (e.g.: when there would be more than two stars * characters in one compression pass).

Finally, could you make sure that the docstrings are accurate ? Could you take a real print in your console to be sure that the documentation is on point with what the code does ?

Finally, we also realize that it's complicated to add the requirement in the requirements.txt file to be compatible with both Python 3.6, 3.7 and 3.8. Could you remove the usage of the dataclass and use a simple data structure like a dict or a list?

Thank you for your contribution ! That is a nice piece of work. We're getting close to have this complicated algorithm sorted out.

@guillaume-chevalier as per the logic we only get <=2 * characters how ever complicated the pipeline may be, the current implementation is based out of our discussion in the past which is like phase1 selection and phase2 pruning. So we always end up with <=2 '*' characters in the string representation. I have tried all the possible combinations that i can think of. If possible can you please provide me with atleast one example as @vincent-antaki mentioned since it is failing at some point, it helps me understand the expectations in more better way so that i can change the implementation accordingly.

cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
Rohith295 commented 3 years ago

@vincent-antaki and @guillaume-chevalier, can you please review my previous comment?

vincent-antaki commented 3 years ago

Hello Rohith!

Here are 7 test cases concerning the expected behaviour of such function. Please see test case 6 for a case with more than 2 stars.

test_case_0 = {'flat_hps': ['Pipeline__ABCD__test__copy', 
                            'Pipeline__ABCD__test1__copy', 
                            'Pipeline__parent1__test__copy', 
                            'Pipeline__parent2__test__copy', 
                            'Pipeline__parent1__hps', 
                            'Pipeline__parent1__abcd__test', 
                            'Pipeline__abcd__test', 
                            'Sklearn__done'],
              'compressed': ["*ABCD__test__copy",
                            "*test1__copy",
                            "*parent1*copy",
                            "*parent2*copy",
                            '*hps',
                            '*parent1*test',
                            'Pipeline__abcd__test',
                            "*done"]
                            }

test_case_1 = {'flat_hps': ['Pipeline__ABCD__copy', 'Pipeline2__ABCD__copy'],
             'compressed': ['Pipeline*copy', 'Pipeline2*copy']}

test_case_2 = {'flat_hps': ['Pipeline__abc__copy', 'Pipeline__abc__test'],
              'compressed': ['*copy', '*test']}

test_case_3 = {'flat_hps': ['Pipeline__copy', 'Pipeline2__copy'],
             'compressed': ['Pipeline__copy', 'Pipeline2__copy']}

test_case_4 = {'flat_hps': ['Pipeline__abc__def__copy', 'Pipeline__def__def__copy'],
              'compressed': ['*abc*copy', '*def*copy']}

test_case_5 = {'flat_hps': ['Pipeline__qwe__abc__copy',
                            'Pipeline__qwe__def__copy', 
                            'Pipeline2__abc__def__copy'],
              'compressed': ['*qwe__abc__copy', 
                             '*qwe__def__copy',
                             '*abc__def__copy']}

test_case_6 = {'flat_hps': ['Pipeline__submodel1__qwe__abc__123__copy',
                            'Pipeline__submodel1__qwe__def__123__copy', 
                            'Pipeline__submodel2__qwe__abc__def__123__copy'],
              'compressed': ['*submodel1*abc*copy', 
                             '*submodel1*def*copy',
                             '*abc__def*copy']}

Your solution comes short of the expected behaviour in some cases :

test case 1:
    expected return: ['Pipeline*copy', 'Pipeline2*copy']
    return:  ['Pipeline__ABCD__copy', 'Pipeline2__ABCD__copy']  
    problem: could be further compressed

test case 4:
    expected return: ['*abc*copy', '*def*copy']
    return: ['*abc*copy', '*def__def__copy']
    problem: could be further compressed

test case 5:
    expected return: ['*qwe__abc__copy', '*qwe__def__copy', '*abc__def__copy']
    return: ['*abc__copy', '*qwe__def__copy', '*abc__def__copy']
    This is one is actually fine. *abc__copy is not ambiguous while *abc*copy would be.

test case 6:
    expected return: ['*submodel1*abc*copy', '*submodel1*def*copy', '*abc__def*copy']
    return: ['*abc__123__copy', '*qwe__def__123__copy', '*abc__def__123__copy']
    problem : A shorter compression exists.

I believe that the algorithm might be a bit of a challenge. The way I see it, you probably have to populate a trie with all possible subsets of all hp paths which includes the last part, and then find, for each hp, the shortest one that is unique. You most likely cannot do this with the current 2 phases separation.

I think at this point a simple heuristic similar to what you propose might be good enough for our needs. In any case, the most important constraint is that the proposed solution has to be totally unambiguous w.r.t to which compressed hp name maps to which non-compressed hp name (which your solutions does). If you fix the test case 1 then I'm happy with the result.

Cheers, Vincent

Rohith295 commented 3 years ago

Hello Rohith!

Here are 7 test cases concerning the expected behaviour of such function. Please see test case 6 for a case with more than 2 stars.

test_case_0 = {'flat_hps': ['Pipeline__ABCD__test__copy', 
                            'Pipeline__ABCD__test1__copy', 
                            'Pipeline__parent1__test__copy', 
                            'Pipeline__parent2__test__copy', 
                            'Pipeline__parent1__hps', 
                            'Pipeline__parent1__abcd__test', 
                            'Pipeline__abcd__test', 
                            'Sklearn__done'],
              'compressed': ["*ABCD__test__copy",
                            "*test1__copy",
                            "*parent1*copy",
                            "*parent2*copy",
                            '*hps',
                            '*parent1*test',
                            'Pipeline__abcd__test',
                            "*done"]
                            }

test_case_1 = {'flat_hps': ['Pipeline__ABCD__copy', 'Pipeline2__ABCD__copy'],
             'compressed': ['Pipeline*copy', 'Pipeline2*copy']}

test_case_2 = {'flat_hps': ['Pipeline__abc__copy', 'Pipeline__abc__test'],
              'compressed': ['*copy', '*test']}

test_case_3 = {'flat_hps': ['Pipeline__copy', 'Pipeline2__copy'],
             'compressed': ['Pipeline__copy', 'Pipeline2__copy']}

test_case_4 = {'flat_hps': ['Pipeline__abc__def__copy', 'Pipeline__def__def__copy'],
              'compressed': ['*abc*copy', '*def*copy']}

test_case_5 = {'flat_hps': ['Pipeline__qwe__abc__copy',
                            'Pipeline__qwe__def__copy', 
                            'Pipeline2__abc__def__copy'],
              'compressed': ['*qwe__abc__copy', 
                             '*qwe__def__copy',
                             '*abc__def__copy']}

test_case_6 = {'flat_hps': ['Pipeline__submodel1__qwe__abc__123__copy',
                            'Pipeline__submodel1__qwe__def__123__copy', 
                            'Pipeline__submodel2__qwe__abc__def__123__copy'],
              'compressed': ['*submodel1*abc*copy', 
                             '*submodel1*def*copy',
                             '*abc__def*copy']}

Your solution comes short of the expected behaviour in some cases :

test case 1:
    expected return: ['Pipeline*copy', 'Pipeline2*copy']
    return:  ['Pipeline__ABCD__copy', 'Pipeline2__ABCD__copy']  
    problem: could be further compressed

test case 4:
    expected return: ['*abc*copy', '*def*copy']
    return: ['*abc*copy', '*def__def__copy']
    problem: could be further compressed

test case 5:
    expected return: ['*qwe__abc__copy', '*qwe__def__copy', '*abc__def__copy']
    return: ['*abc__copy', '*qwe__def__copy', '*abc__def__copy']
    This is one is actually fine. *abc__copy is not ambiguous while *abc*copy would be.

test case 6:
    expected return: ['*submodel1*abc*copy', '*submodel1*def*copy', '*abc__def*copy']
    return: ['*abc__123__copy', '*qwe__def__123__copy', '*abc__def__123__copy']
    problem : A shorter compression exists.

I believe that the algorithm might be a bit of a challenge. The way I see it, you probably have to populate a trie with all possible subsets of all an hp path which includes the last part, and then find, for each hp, the shortest one that is unique. You most likely cannot do this with the current 2 phases separation.

I think at this point a simple heuristics similar to what you propose might be good enough for our needs. In any case, the most important constraint is that proposed solution has to be totally unambiguous w.r.t to which compressed hp name maps to which non-compressed hp name (which your solutions does). If you fix the test case 1 then I'm happy with the result.

Cheers, Vincent

Thank you so much for the detailed explanation and the complex test cases @vincent-antaki . I will try to see how i can fix the issue accordingly.

cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
Rohith295 commented 3 years ago

@vincent-antaki Hey i have changed algorithm a bit, to satisfy test case 0 and test case 1, as mentioned. Please verify and let me know if this is okay. Apologies for the delay, i was bit occupied by some other work all these days.