Rohith295 commented 3 years ago

What it is

My pull request does: This change will try to compress the flat list of hyperparams, into a more structured representation. Please refer the following issue #486 for more information. I haven't added any tests for now, i wanted to make sure this implementation is reviewed.

How it works

I coded it this way: Current implementation will try to read the all_hps = pipeline.get_hyperparams() output and try to compress this particular output into more structured and readable format.

Example usage

Here is how you can use this new code as a end user:

Note: 
Please make dimensions and types clear to the reader.
E.g.: in the event fictious data is processed in this code example.
>>>all_hps = pipeline.get_hyperparams()\
>>>all_hps_shortened = all_hps.compress()
>>>print(type(all_hps_shortened))
...<class 'neuraxle.hyperparams.space.CompressedHyperparameterSamples'>
>>>pprint(all_hps_shortened)
...[
    {

        "step_name": "step1",
        "hyperparams": {'copy': True, 'iterated_power': 'auto'},
        "ancestor_steps": ["root"]
    },
    {

        "step_name": "step2",
        "hyperparams": {'copy': True, 'iterated_power': 'auto'},
        "ancestor_steps": ["root"]
    }

]

cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

Rohith295 commented 3 years ago

@guillaume-chevalier I have followed all the steps as mentioned for configuring my email address as part of signing commits with email address. It still fails.

guillaume-chevalier commented 3 years ago

Hi @Rohith295, I see that your new commits seems to be properly configured. The CLA-BOT seems to still bug because that the first commit remained unidentified and was not recommited. I know for sure however that this contribution is yours as per our discussion this weekend, and you also have signed the CLA and I can confirm that it is you as well, so the PR could still be merged. Let's ignore the CLA-BOT for now. I'll proceed to the code review. Thank you !

Rohith295 commented 3 years ago

Here is your review !

An usage example with a test would be very good.

Thank you!

Yes, I will be adding as part of the unit tests.

pull-checklist[bot] commented 3 years ago

New contributor? Ensure you do this.

[ ] Your local Git username is set to your GitHub username, and your local Git email is set to your GitHub email. This is important to avoid breaking the cla-bot and for your contributions to be linked to your profile. If at least 1 contribution is not commited properly using the good credentials, the cla-bot will break until your re-commit it.
[ ] Use the PyCharm IDE with PyTest to test your code. Reformatting your code at every file save is a good idea, using PyCharm's Ctrl+Alt+L shortcut. You may reorganize imports automatically as well, as long as your project root is well configured. Run the tests to see if everything works, and always ensure that all tests run before opening a pull request as well.
[ ] We recommend letting PyCharm manage the virtual environment by creating a new one just for this project, and using PyTest as a test runner in PyCharm. This is not required, but should help in getting you started.
[ ] Please make your pull request(s) editable, such as for us to add you to the list of contributors if you didn't add the entry, for example.
[ ] To contribute, first fork the project, then do your changes, and then open a pull request in the main repository.
[ ] Sign the Contributor License Agreement (CLA) to allow Neuraxio to use and publish your contributions under the Apache 2.0 license, in order for everyone to be able to use your open-source contributions. Follow the instructions of the cla-bot upon opening the pull request.
Things to check at each Pull Request (PR)
[ ] Your local Git username is set to your GitHub username, and your local Git email is set to your GitHub email. This is important to avoid breaking the cla-bot and for your contributions to be linked to your profile. More info: https://github.com/settings/emails
[ ] Argument's dimensions and types are specified for new steps (important), with examples in docstrings when needed.
[ ] Class names and argument / API variables are very clear: there is no possible ambiguity. They also respect the existing code style (avoid duplicating words for the same concept) and are intuitive.
[ ] Classes are documented: their behavior is explained beyond just the title of the class. You may even use the description written in your pull request above to fill some docstrings accurately.
[ ] If a numpy array is used, it is important to remember that these arrays are a special type that must be documented accordingly, and that numpy array should not be abused. This is because Neuraxle is a library that is not only limited to transforming numpy arrays. To this effect, numpy steps should probably be located in the existing numpy python files as much as possible, and not be all over the place. The same applies to Pandas DataFrames.
[ ] Code coverage is above 90% for the added code for the unit tests.
[ ] Respect the Unit Testing status check
[ ] Respect the Codacy status check
[ ] Respect the cla-bot status check (unless the cla-bot is truly broken - please try to debug it first)
[ ] The above description of the pull request in natural language was used to document the new code inside the code's docstrings so as to have complete documentation, with examples.
[ ] Code files that were edited were reformatted automatically using PyCharm's Ctrl+Alt+L shortcut.

cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

Rohith295 commented 3 years ago

Unit tests are in progress

cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

Rohith295 commented 3 years ago

@guillaume-chevalier updated PR with tests aswell.

cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

Rohith295 commented 3 years ago

Here are some review comments, mainly regarding the unit test and software design.

Also I know that @vincent-antaki didn't have the time to finish his review friday. He will come back next Monday. I think he told me quickly that the present algorithm didn't cover every cases and was flawed in the sense that it was good at computing the compressed hyperparameters for the given unit test, but that your algorithm seemed not complete and would fail for more complicated wildcard patterns (e.g.: when there would be more than two stars * characters in one compression pass).

Finally, could you make sure that the docstrings are accurate ? Could you take a real print in your console to be sure that the documentation is on point with what the code does ?

Finally, we also realize that it's complicated to add the requirement in the requirements.txt file to be compatible with both Python 3.6, 3.7 and 3.8. Could you remove the usage of the dataclass and use a simple data structure like a dict or a list?

Thank you for your contribution ! That is a nice piece of work. We're getting close to have this complicated algorithm sorted out.

@guillaume-chevalier as per the logic we only get <=2 * characters how ever complicated the pipeline may be, the current implementation is based out of our discussion in the past which is like phase1 selection and phase2 pruning. So we always end up with <=2 '*' characters in the string representation. I have tried all the possible combinations that i can think of. If possible can you please provide me with atleast one example as @vincent-antaki mentioned since it is failing at some point, it helps me understand the expectations in more better way so that i can change the implementation accordingly.

cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

Rohith295 commented 3 years ago

@vincent-antaki and @guillaume-chevalier, can you please review my previous comment?

vincent-antaki commented 3 years ago

Hello Rohith!

Here are 7 test cases concerning the expected behaviour of such function. Please see test case 6 for a case with more than 2 stars.

test_case_0 = {'flat_hps': ['Pipeline__ABCD__test__copy', 
                            'Pipeline__ABCD__test1__copy', 
                            'Pipeline__parent1__test__copy', 
                            'Pipeline__parent2__test__copy', 
                            'Pipeline__parent1__hps', 
                            'Pipeline__parent1__abcd__test', 
                            'Pipeline__abcd__test', 
                            'Sklearn__done'],
              'compressed': ["*ABCD__test__copy",
                            "*test1__copy",
                            "*parent1*copy",
                            "*parent2*copy",
                            '*hps',
                            '*parent1*test',
                            'Pipeline__abcd__test',
                            "*done"]
                            }

test_case_1 = {'flat_hps': ['Pipeline__ABCD__copy', 'Pipeline2__ABCD__copy'],
             'compressed': ['Pipeline*copy', 'Pipeline2*copy']}

test_case_2 = {'flat_hps': ['Pipeline__abc__copy', 'Pipeline__abc__test'],
              'compressed': ['*copy', '*test']}

test_case_3 = {'flat_hps': ['Pipeline__copy', 'Pipeline2__copy'],
             'compressed': ['Pipeline__copy', 'Pipeline2__copy']}

test_case_4 = {'flat_hps': ['Pipeline__abc__def__copy', 'Pipeline__def__def__copy'],
              'compressed': ['*abc*copy', '*def*copy']}

test_case_5 = {'flat_hps': ['Pipeline__qwe__abc__copy',
                            'Pipeline__qwe__def__copy', 
                            'Pipeline2__abc__def__copy'],
              'compressed': ['*qwe__abc__copy', 
                             '*qwe__def__copy',
                             '*abc__def__copy']}

test_case_6 = {'flat_hps': ['Pipeline__submodel1__qwe__abc__123__copy',
                            'Pipeline__submodel1__qwe__def__123__copy', 
                            'Pipeline__submodel2__qwe__abc__def__123__copy'],
              'compressed': ['*submodel1*abc*copy', 
                             '*submodel1*def*copy',
                             '*abc__def*copy']}

Your solution comes short of the expected behaviour in some cases :

test case 1:
    expected return: ['Pipeline*copy', 'Pipeline2*copy']
    return:  ['Pipeline__ABCD__copy', 'Pipeline2__ABCD__copy']  
    problem: could be further compressed

test case 4:
    expected return: ['*abc*copy', '*def*copy']
    return: ['*abc*copy', '*def__def__copy']
    problem: could be further compressed

test case 5:
    expected return: ['*qwe__abc__copy', '*qwe__def__copy', '*abc__def__copy']
    return: ['*abc__copy', '*qwe__def__copy', '*abc__def__copy']
    This is one is actually fine. *abc__copy is not ambiguous while *abc*copy would be.

test case 6:
    expected return: ['*submodel1*abc*copy', '*submodel1*def*copy', '*abc__def*copy']
    return: ['*abc__123__copy', '*qwe__def__123__copy', '*abc__def__123__copy']
    problem : A shorter compression exists.

I believe that the algorithm might be a bit of a challenge. The way I see it, you probably have to populate a trie with all possible subsets of all hp paths which includes the last part, and then find, for each hp, the shortest one that is unique. You most likely cannot do this with the current 2 phases separation.

I think at this point a simple heuristic similar to what you propose might be good enough for our needs. In any case, the most important constraint is that the proposed solution has to be totally unambiguous w.r.t to which compressed hp name maps to which non-compressed hp name (which your solutions does). If you fix the test case 1 then I'm happy with the result.

Cheers, Vincent

Rohith295 commented 3 years ago

Hello Rohith!

Here are 7 test cases concerning the expected behaviour of such function. Please see test case 6 for a case with more than 2 stars.

test_case_0 = {'flat_hps': ['Pipeline__ABCD__test__copy', 
                            'Pipeline__ABCD__test1__copy', 
                            'Pipeline__parent1__test__copy', 
                            'Pipeline__parent2__test__copy', 
                            'Pipeline__parent1__hps', 
                            'Pipeline__parent1__abcd__test', 
                            'Pipeline__abcd__test', 
                            'Sklearn__done'],
              'compressed': ["*ABCD__test__copy",
                            "*test1__copy",
                            "*parent1*copy",
                            "*parent2*copy",
                            '*hps',
                            '*parent1*test',
                            'Pipeline__abcd__test',
                            "*done"]
                            }

test_case_1 = {'flat_hps': ['Pipeline__ABCD__copy', 'Pipeline2__ABCD__copy'],
             'compressed': ['Pipeline*copy', 'Pipeline2*copy']}

test_case_2 = {'flat_hps': ['Pipeline__abc__copy', 'Pipeline__abc__test'],
              'compressed': ['*copy', '*test']}

test_case_3 = {'flat_hps': ['Pipeline__copy', 'Pipeline2__copy'],
             'compressed': ['Pipeline__copy', 'Pipeline2__copy']}

test_case_4 = {'flat_hps': ['Pipeline__abc__def__copy', 'Pipeline__def__def__copy'],
              'compressed': ['*abc*copy', '*def*copy']}

test_case_5 = {'flat_hps': ['Pipeline__qwe__abc__copy',
                            'Pipeline__qwe__def__copy', 
                            'Pipeline2__abc__def__copy'],
              'compressed': ['*qwe__abc__copy', 
                             '*qwe__def__copy',
                             '*abc__def__copy']}

test_case_6 = {'flat_hps': ['Pipeline__submodel1__qwe__abc__123__copy',
                            'Pipeline__submodel1__qwe__def__123__copy', 
                            'Pipeline__submodel2__qwe__abc__def__123__copy'],
              'compressed': ['*submodel1*abc*copy', 
                             '*submodel1*def*copy',
                             '*abc__def*copy']}

Your solution comes short of the expected behaviour in some cases :

test case 1:
    expected return: ['Pipeline*copy', 'Pipeline2*copy']
    return:  ['Pipeline__ABCD__copy', 'Pipeline2__ABCD__copy']  
    problem: could be further compressed

test case 4:
    expected return: ['*abc*copy', '*def*copy']
    return: ['*abc*copy', '*def__def__copy']
    problem: could be further compressed

test case 5:
    expected return: ['*qwe__abc__copy', '*qwe__def__copy', '*abc__def__copy']
    return: ['*abc__copy', '*qwe__def__copy', '*abc__def__copy']
    This is one is actually fine. *abc__copy is not ambiguous while *abc*copy would be.

test case 6:
    expected return: ['*submodel1*abc*copy', '*submodel1*def*copy', '*abc__def*copy']
    return: ['*abc__123__copy', '*qwe__def__123__copy', '*abc__def__123__copy']
    problem : A shorter compression exists.

I believe that the algorithm might be a bit of a challenge. The way I see it, you probably have to populate a trie with all possible subsets of all an hp path which includes the last part, and then find, for each hp, the shortest one that is unique. You most likely cannot do this with the current 2 phases separation.

I think at this point a simple heuristics similar to what you propose might be good enough for our needs. In any case, the most important constraint is that proposed solution has to be totally unambiguous w.r.t to which compressed hp name maps to which non-compressed hp name (which your solutions does). If you fix the test case 1 then I'm happy with the result.

Cheers, Vincent

Thank you so much for the detailed explanation and the complex test cases @vincent-antaki . I will try to see how i can fix the issue accordingly.

cla-bot[bot] commented 3 years ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Rohith Rangaraju. This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

Rohith295 commented 3 years ago

@vincent-antaki Hey i have changed algorithm a bit, to satisfy test case 0 and test case 1, as mentioned. Please verify and let me know if this is okay. Apologies for the delay, i was bit occupied by some other work all these days.

Neuraxio / Neuraxle

Recursive dict compress feature #502

What it is

How it works

Example usage

New contributor? Ensure you do this.

Things to check at each Pull Request (PR)