MLopez-Ibanez / irace

Iterated Racing for Automatic Algorithm Configuration
https://mlopez-ibanez.github.io/irace/
GNU General Public License v2.0
58 stars 14 forks source link

add readParametersData #46

Closed DE0CH closed 10 months ago

DE0CH commented 1 year ago

This PR adds functionality to parse parameters from an R list (as opposed to a string). This would be useful to build a user friendly native object interface with iracepy where user does't have to format a string themselves.

Summary of changes:

Things that can be improved:

MLopez-Ibanez commented 1 year ago

Hi @DE0CH , it is great that you decided to submit a PR. Could you give me a bit more context? Why would a user want to define the parameters in this way? It seems much more typing and much more room for errors.

I can see the uses of a programmatic interface that added parameters one by one, something like:

params <- ParameterSet$new() params$newInt(name="a", domain=c(1,10), transf="log") params$newReal(name="b", domain=c(0.0,1.0), transf="log")

Actually, the above is very similar to what paradox provides (see the examples here: https://paradox.mlr-org.com/). Having an interface similar to paradox would allow us to use their code (and it will interfacing irace with mlr3).

I'm not saying that I'm against what you propose but I'm trying to understand what is the use-case.

Cheers!

DE0CH commented 1 year ago

Hi Manuel, thanks for the reply. I thought of adding this when I saw the the way the parameters are specified in example_dual_annealing.py in iracepy.

I agree that it is much easier and error-free for humans to just write the string than using this interface, but I don't intend this interface to be used by end-users directly. It's mostly used by other wrappers that have much better interface and error checking. For example, adding compatiblity with the paradox package and compatiblility for ConfigSpace configurations (which I am currently working on for iracepy). These libraries will have to pass data into irace somehow and I think the best way is through this newly added readParametersData. It is much easier and secure for library authors to use because they don't have to worry about formatting a string and escaping special characters. As such, the function assumes the input is correct as checked by user-facing interfaces and will result in undefined behaviour if the input is invalid.

As I see it, the readParameters function accomplishes two things. First, it parses the string into native data structure for easier manipulation and secondly, it resolves the dependencies and compute properties for irace to use later such as the nbFixed, depends and hierarchy fields in the parameters returned. The readParameterData functoin would only do the second part and let whatever user facing library do the first part.

One use case I am considering is to load SMAC3 configurations from the surrogate modles in aclib2 for meta-tuning irace. If I don't want to manually translate tens of configurations files manually by hand, I need some way to translate the configuration files automaically. I am considering to load in the parameters file with ConfigSpace and then translate it into the format that readParametersData accepts and then get a parameters object that irace can use natively.

In summary, this interface is designed as an low level interface for other user facing interfaces. One of which is iracepy which I am currently working on.

I hope this helps!

MLopez-Ibanez commented 1 year ago

Hi,

I agree with you completely. My only question is about the interface. If you look at the interfaces of paradox and ConfigSpace, they are very similar to what I have described above, where parameters are defined one by one. This would allow current readParameters to use the same interface to create the parameters object instead of having 2 functions in irace to create parameters that know the low-level details. Having such interface would also make much easier to move to paradox in the future.

Best,

Manuel.

DE0CH commented 1 year ago

I am not sure I understand what you mean. I see that getting the interface right is quite important, but do you mean instead of having list of five lists, each corresponding to a property, I should have a list of structs / objects where each object correspond to one line in the configuraiton? i.e. it is the transpose / zip of the data structure I currently have. To be honest, I prefer having serveral vectors of primitive data types instead of a list of objects. I find it easier to manipulate, uses less memory and is slightly faster. Also, although I haven't looked at how r2py works, but I think it would be way easier to pass through the interface five lists of strings, numbers and expressions than a list of objects.

As to why there are two functions, I mostly just don't want to touch the readParameter function lest I break something. Ideally, I think readParameter should parse the text file into the data format that readParameterData accepts and let it handle the rest, but I don't think I should do that now since it's quite a large refactor and it's easy to mess up.

Anyways, I am happy to change the interface to whatever you prefer. So please just let me know the specifications and I will make changes accordingly. Actually now that I think about, it would much prefer to keep this interface simple and primitive to maximise its comptablity with all sorts of binding with other languages and formats. As for compatablity with paradox, I think another function that coverts a paradox to this primitive format and back should be better.

What do you think?

MLopez-Ibanez commented 1 year ago

I am not sure I understand what you mean. I see that getting the interface right is quite important, but do you mean instead of having list of five lists, each corresponding to a property, I should have a list of structs / objects where each object correspond to one line in the configuraiton? To be honest, I prefer having serveral vectors of primitive data types instead of a list of objects. I find it easier to manipulate, uses less memory and is slightly faster.

Not necessarily, currently the internals of the parameters "object" could stay as they currently are, but the way to create it should be less error-prone. It may look easy to create a list of five lists when you have less than 10 parameters but when you have 200, it becomes harder to track what element of the sublists correspond to the other elements of the sublists. Also, your example already has quite a bit of redundancy and default values that the user needs to track. Instead of:

d = list()
d$names <- c(
    'initial_temp',
    'restart_temp_ratio',
    'visit',
    'b',
    'accept',
    'no_local_search'
)

d$switches <- c(
    '',
    '',
    '',
    '',
    '',
    ''
)

d$types <- c(
    'r,log',
    'r,log',
    'r',
    'r',
    'r',
    'o'
)

d$conditions <- list(
    TRUE,
    TRUE,
    TRUE,
    TRUE,
    expression(no_local_search == "ccc"),
    TRUE
)

d$domains <- list(
    c(0.02, 5e4),
    c(1e-5, 1),
    c(1.001, 3),
    expression(visit, visit + 10),
    c(-1e3, -5),
    c('', '1', 'ccc')
)

the user (or the Python code via rpy2) can write:

d = ParametersNew(p_real('initial_temp', 0.02, 5e4, transf="log"),
                  p_real('restart_temp_ratio', 1e-5, 1, transf="log"),
                  p_real('visit', 1.001, 3),
                  p_real('b', quote(visit), quote(visit + 10)),
                  p_real('accept', -1e3, -5, quote(no_local_search == "ccc")),
                  p_ord('no_local_search', '1', 'ccc'))

Also, although I haven't looked at how r2py works, but I think it would be way easier to pass through the interface five lists of strings, numbers and expressions than a list of objects.

Maybe yes, I have some doubts about it, specially passing expressions around, those will need to be passed as strings and parsed within irace (either by readParameterData or by ParametersNew). Calling R functions and passing numbers and strings is easy. Passing R objects through Python is also easy (Python calls an R function that returns something and gives this something to another R functions). But converting from R to Python and viceversa gets tricky.

But this is not what I'm proposing. My proposal is that iracepy (or whatever python code) will call the functions ParametersNew, p_real, p_ord, p_int, etc to create the object directly in R format.

As to why there are two functions, I mostly just don't want to touch the readParameter function lest I break something. Ideally, I think readParameter should parse the text file into the data format that readParameterData accepts and let it handle the rest, but I don't think I should do that now since it's quite a large refactor and it's easy to mess up.

Fair enough, we can move readParameter to use the new interface when it is working.

~Anyways, I am happy to change the interface to whatever you prefer. So please just let me know the specifications and I will make changes accordingly.~ Actually now that I think about, it would much prefer to keep this interface simple and primitive to maximise its comptablity with all sorts of binding with other languages and formats.

If you are happy with the interface that I propose above, then I would prefer something like that. You can even have a function in python that takes a Python list of lists (or whatever you want to write) and calls the R functions to create parameters R object.

Otherwise, another option that would not require touching irace at all and it is even simpler is to implement in python a function that converts from ConfigSpace to irace's textual format and then use the current readParameters as it is. This would be quite useful in any case, so one can dump that file to a .txt and look at it.

I hope the above makes sense!

DE0CH commented 1 year ago

I agree. Your way of creating the data is much more user friendly and less error prone. But my idea is not for an end user to use this function, but it's more of an internal function. The code snipt is just for documenting the input format. In reality, a user should never need to write such code. In the documentation, I should point user to use readParameter and whatever interface in iracepy to create the parameter.

The function receives data from readParameter and iracepy, both of which would contain input validation. The internal representation in readParameter is already quite similar to the current interface (the parameters varaible), so it would not need to do any more work to use the readParaemterData internal function.

But this is not what I'm proposing. My proposal is that iracepy (or whatever python code) will call the functions ParametersNew, p_real, p_ord, p_int, etc to create the object directly in R format.

I would like to counter propose that the python library creates paraemters with python objects and convert to the format that readParametersData accepts. For example:

class IraceParameters:
    ...
    def add_parmeter(self, type, switch, types, domains, conditions):
        ...
    def export(self):
        parameters = convert to the r object compatible with readParametersData
        parameters = irace.readParametersData()

class Irace:
    def run(self):
        irace(scenario, parameters.export())
        ...

If you are happy with the interface that I propose above, then I would prefer something like that.

Is it possible that we have two functions? In addition to the current function, we have another function readParametersObject (or some other clearer name) which takes input in the format you perfer and calls readParametersData internally? It needs to convert the format anyways to use tansform.domain, isFixed anyways.

Otherwise, another option that would not require touching irace at all and it is even simpler is to implement in python a function that converts from ConfigSpace to irace's textual format and then use the current readParameters as it is. This would be quite useful in any case, so one can dump that file to a .txt and look at it.

I don't really like this idea because there are a lot of ways it can go wrong. What if the variable contains a special character such as space, quote or "|"? It's hacky at best and a secruity flaw at worst.

MLopez-Ibanez commented 1 year ago

I agree. Your way of creating the data is much more user friendly and less error prone. But my idea is not for an end user to use this function, but it's more of an internal function. The code snipt is just for documenting the input format. In reality, a user should never need to write such code. In the documentation, I should point user to use readParameter and whatever interface in iracepy to create the parameter.

If it is for internal use, then why need to pass a list of lists? Why not simply pass the list of names, switches, domains, etc? You also don't need to pass "n", since this can be inferred from the length of the list of names.

I would like to counter propose that the python library creates paraemters with python objects and convert to the format that readParametersData accepts. For example:

I really don't see how this is better because the python library has to create a representation that is not the representation used by the R code and the functions of the IraceParams class has to either duplicate the error checking done in the R code or do no error checking at all until .export() is called.

But I don't want to stop you, if you think this is better, I'm happy to accept this for your particular usecase. But first create the Python interface and see how it works with your proposed readParametersData (without the list of lists, just pass the individual lists, and without the 'n' as that is simply the length of the names list) as that may lead you to update this PR.

Also, please use the parameters in the example of readParameters as that exercises many of the features required. Once the Python code is working, I'm happy to accept this function (or however it ends up looking) with a note that it is for internal use only.

Is it possible that we have two functions? In addition to the current function, we have another function readParametersObject (or some other clearer name) which takes input in the format you perfer and calls readParametersData internally? It needs to convert the format anyways to use tansform.domain, isFixed anyways.

Perhaps my proposal was not clear. In my proposal readParametersData will be the one calling parametersNew and the other helper functions, because those functions will be the ones that contain all the logic and error-checking that builds the parameters object as it is now. It doesn't make sense to have a function readParametersObject: It is not useful for Python because it is not easy to convert between Python/R objects and it is not useful for R because you either want a text representation (which we already have) or a programmatic interface (which the proposed parametersNew would provide). But as I said, I'm not going to stop you, I just ask that you get a complete solution working, including the Python part, and tested with a complex example, like the example provided with readParameters and then I'm happy to accept this PR.

I don't really like this idea because there are a lot of ways it can go wrong. What if the variable contains a special character such as space, quote or "|"? It's hacky at best and a secruity flaw at worst.

There is no other way to parse a textual representation of the conditions unless a custom-made parser is written, which would be slower and a waste of time. Yes, in principle the user can write a condition that deletes your hard-drive. But they can also do that in the target-runner or worse, in something that is called by the target-runner and completely out of your view. If you are running a tuning scenario from a source you don't trust then something is very wrong already because you are asking irace to run user-provided code!

But as I said, if you manage to make the Python part work with a realistic example, I'm not opposed to adding this function even if you are the only user of it.

DE0CH commented 1 year ago

Thanks for all the advice. Yes, I think having a complete solution would be much benifitial. I will work on it in the next few days and I will come back to you. I will convert this PR to draft now and re-request your review when I'm ready. Thanks.

If it is for internal use, then why need to pass a list of lists? Why not simply pass the list of names, switches, domains, etc? You also don't need to pass "n", since this can be inferred from the length of the list of names.

That's a good idea. I have not thought of that.

I really don't see how this is better because the python library has to create a representation that is not the representation used by the R code and the functions of the IraceParams class has to either duplicate the error checking done in the R code or do no error checking at all until .export() is called.

I will try to implement error checking and see how it can be properly done.

Also, please use the parameters in the example of readParameters as that exercises many of the features required.

Sure.

In my proposal readParametersData will be the one calling parametersNew and the other helper functions, because those functions will be the ones that contain all the logic and error-checking that builds the parameters object as it is now. It doesn't make sense to have a function readParametersObject: It is not useful for Python because it is not easy to convert between Python/R objects and it is not useful for R because you either want a text representation (which we already have) or a programmatic interface (which the proposed parametersNew would provide).

I'm sorry I don't think I understand what you mean now. Perhaps when I have the python code it will be earsier for us to discuss about this and be on the same page.

There is no other way to parse a textual representation of the conditions unless a custom-made parser is written, which would be slower and a waste of time. Yes, in principle the user can write a condition that deletes your hard-drive. But they can also do that in the target-runner or worse, in something that is called by the target-runner and completely out of your view. If you are running a tuning scenario from a source you don't trust then something is very wrong already because you are asking irace to run user-provided code!

I see!

But as I said, if you manage to make the Python part work with a realistic example, I'm not opposed to adding this function even if you are the only user of it.

I hope I won't be the only user of this. Yes I am currently the only author (other than you) working on iracepy which would benefit from this, but I hope other people would find it useful so this PR would indirectly benefit them.

DE0CH commented 1 year ago

The iracepy interface is pretty much done at auto-optimization/iracepy#5.

Not sure why CI checks fail on macOS. Can you help with that?

Saethox commented 1 year ago

What's the status of this? I am interested in this feature for my own Python wrapper iracepy-tiny.

If possible, I would very much prefer an interface like paradox, as you have mentioned. My wrapper currently translates from a similar structure into irace's text format, which is then parsed again. It works, but it's kind of ugly.

MLopez-Ibanez commented 1 year ago

What's the status of this? I am interested in this feature for my own Python wrapper iracepy-tiny.

If possible, I would very much prefer an interface like paradox, as you have mentioned. My wrapper currently translates from a similar structure into irace's text format, which is then parsed again. It works, but it's kind of ugly.

Hi Jonathan!

There is a branch: https://github.com/MLopez-Ibanez/irace/tree/parametersNew implementing an interface like paradox. In fact, I would like to use paradox directly but there are a few unresolved questions about paradox.

The only reason I have not merged this branch into main is that I haven't had time to test it thoroughly, but the plan is to integrate it before the next irace release, with or without paradox. I would be happy to get feedback on the interface and whether something is missing or could be done better.

(an unrelated question: why fork iracepy? I would happily let you merge your improvements)

MLopez-Ibanez commented 1 year ago

The new functions are at the bottom of https://github.com/MLopez-Ibanez/irace/blob/parametersNew/R/parameters.R

DE0CH commented 1 year ago

Hi,

Thanks for your interest in the feature.

The feature is working and I am just waiting for @MLopez-Ibanez approval. I was planning to use this on iracepy but I kind of just gave up on developing iracepy because there's a lot of problems with passing data through and parallel computing. I just think doing interprocess communication (e.g. though tcp, unix socket, or pipes) works a lot better than trying to bind r to python through embeded r (which is what rpy2 uses).

So I think at least for me, I am open for you to use this interface as is or change it as it fits for iracepy-tiny packages, eventually @MLopez-Ibanez might approve this changes, but he might also just close it.

Sorry I probably don't have time to work on it this month or next (my university begins).

MLopez-Ibanez commented 1 year ago

Rather than the approach in this PR, I'd rather use the interface proposed in: https://github.com/MLopez-Ibanez/irace/blob/parametersNew/R/parameters.R

Saethox commented 1 year ago

Rather than the approach in this PR, I'd rather use the interface proposed in: https://github.com/MLopez-Ibanez/irace/blob/parametersNew/R/parameters.R

That looks exactly like what I want, thank you. I'll try to integrate it, and let you know of the result.

(an unrelated question: why fork iracepy? I would happily let you merge your improvements)

iracepy-tiny was originally only meant as a stepping stone for an irace Rust wrapper, so I'm only planning support for exactly the features that I need. I'm still experimenting with the interface right now, but if it's more stable I would be happy to integrate the changes back into iracepy, if there's demand.