Raise exception instead of auto replace bad characters

alexlusak commented 4 days ago

I have a constraint where the index is something akin to the following:

('TEST', 'Apple-Banana Redo', 1, 1)

when adding this constraint directly through Gurobi, this would return an error due to incorrect characters. However, in gurobipy-pandas, I noticed the "bad" characters are automatically replaced with _, as in the following:

'c_test[TEST,Apple_Banana_Redo,1,1]'

This presents an issue when we go to extract outputs from the model and merge things back on to our original dataframes, since now the index has changed. Either we merge rows out or see NA values when left merging.

I think that this line should be modified, preferably to raise an error instead of just performing the string replace:

return index.map(str).str.replace(r"[\+\-\*\^\:\s]+", "_", regex=True)

I also think it's probably worth it to fix the datetime conditional branch as well - this should really force users to have datetimes represented in "%Y_%m_%dT%H_%M_%S" formatting.

alexlusak commented 4 days ago

I should add that I especially think this should change because my understanding is that this behavior is different than what occurs with gurobi.

simonbowly commented 4 days ago

@alexlusak thanks for bringing this up. The goal was to, by default, avoid extraneous extra characters that would cause Gurobi to write model files that can't be read back in. From memory we found pandas objects (especially datetime) behaved a little differently in their string representations to plain python which was part of the reason for that choice.

I can certainly see the argument for making this completely consistent with gurobipy. That behaviour is to just let the names go through as-is. There is an option for this in gurobipy-pandas, which is to pass index_formatter='disable' to the add_vars / add_constrs methods. This is documented on this page and in the API docs for those functions.

Would that option resolve your issue? I would prefer to leave the defaults as-is unless there are a lot of objections. gurobipy-pandas is meant as a higher-level interface so it seems reasonable to do some extra "help" for the user here.

simonbowly commented 4 days ago

This presents an issue when we go to extract outputs from the model and merge things back on to our original dataframes, since now the index has changed. Either we merge rows out or see NA values when left merging.

By the way, I'm curious to know how the naming impacts indexes when merging data? Does your code require joining variables/constraints based on their names? Or did I misunderstand something here?

alexlusak commented 4 days ago

This issue arose because there is a section of our code where we extract Pi values from a set of constraints, and when we parse those constraints out into a dataframe, the second column in my previous example with -s in the name had changed to _, so when we go to left join back onto our original dataframe which was used to create variables/constraints the indices don't match.

I'm not entirely sure if just setting the index_formatter='disable' would resolve the issue I was raising per se - obviously it would fix the join issue I mentioned, but disregarding the datetimes for a second, I just found it a bit troubling that gppd automatically replaces these charcaters whereas gurobipy will throw an error if it detects a space character for example. I would think if anything gppd would either also check for misformatting, or pass along the misformatted string and let it be caught by gurobipy (not ideal imo).

alexlusak commented 4 days ago

An alternative could be to at least print out some warning message to the user that the replacement occurred?

simonbowly commented 4 days ago

I just found it a bit troubling that gppd automatically replaces these characters whereas gurobipy will throw an error if it detects a space character for example.

My point is that gurobipy does not throw an error if it detects a space character (at least to my knowledge, if you're seeing different behaviour please let me know). The most it will do is print a warning, and only does that if you write the model out to file format that forbids these characters. Actually for LP format what it will do there is replace illegal characters with underscores in the written LP. For MPS format, it will ignore the user-provided names entirely.

>>> x = model.addVar(name="x     y")
>>> c = model.addConstr(x == 1, name="<>?:{}|_+")
>>> model.write("model.lp")
Warning: variable name "x     y" has a space
Warning: constraint name "<>?:{}|_+" has a colon
Warning: to let Gurobi read it back, use rlp format
>>> model.write("model.mps")
Warning: variable name "x     y" has a space
Warning: constraint name "<>?:{}|_+" has a colon
Warning: default variable names used to write mps file
Warning: default constraint names used to write mps file

I can understand the inconsistency is annoying and I'll consider changing the defaults here. Just to be clear though, if we make a change it would be to just not do any character replacement (and not to error out). Note though that we may not be able to make the naming fully consistent, since gurobipy generates names typically from an iterable of python objects, while gurobipy-pandas is using a pandas index.

This issue arose because there is a section of our code where we extract Pi values from a set of constraints, and when we parse those constraints out into a dataframe, the second column in my previous example with -s in the name had changed to _, so when we go to left join back onto our original dataframe which was used to create variables/constraints the indices don't match.

When writing models either with gurobipy-pandas, or with gurobipy directly, I would always avoid having the logic of the program depend on the generated names of individual variables and constraints. This is really making program logic depend on python's str() behaviour. By design gurobipy-pandas does not change the stem string that you provide to the name argument, and it uses this as-is for the names of new dataframe columns or series it creates. So you should be able to rely on column names behaving predictably. If it does change those, I would definitely consider that to be a bug.

Could you please share a minimal example? I'd like to understand the workflow here, I think I must be missing something, sorry.

Dr-Irv commented 4 days ago

This issue arose because there is a section of our code where we extract Pi values from a set of constraints, and when we parse those constraints out into a dataframe, the second column in my previous example with -s in the name had changed to _, so when we go to left join back onto our original dataframe which was used to create variables/constraints the indices don't match.

IMHO, this is bad practice. When you create your constraints with gurobipy-pandas, you get a Series of constraint objects that have an index corresponding to your constraints, such that the index will have values corresponding to your data. So you can get the pi values for those constraints with the same index.

It is not a good practice to use the names of constraints or variables to do joins back to your original data.

Gurobi / gurobipy-pandas

Raise exception instead of auto replace bad characters #90