emdgroup / baybe

Bayesian Optimization and Design of Experiments
https://emdgroup.github.io/baybe/
Apache License 2.0
269 stars 42 forks source link

Polars product inconsistent with pandas product #424

Open AdrianSosic opened 1 week ago

AdrianSosic commented 1 week ago

The order of the Cartesian product elements is different when created via polars or pandas. That is, depending on BAYBE_DEACTIVATE_POLARS , the following code gives two different outputs:

from baybe.parameters.numerical import NumericalDiscreteParameter
from baybe.searchspace.discrete import SubspaceDiscrete

s = SubspaceDiscrete.from_product(
    [
        NumericalDiscreteParameter("a", [0, 1]),
        NumericalDiscreteParameter("b", [3, 4, 5]),
    ]
)
print(s.exp_rep)

Pandas:

     a    b
0  0.0  3.0
1  0.0  4.0
2  0.0  5.0
3  1.0  3.0
4  1.0  4.0
5  1.0  5.0

Polars:

0  0.0  3.0
1  1.0  3.0
2  0.0  4.0
3  1.0  4.0
4  0.0  5.0
5  1.0  5.0

Not a "bug" in the strict sense, but still annoying and can cause troubles in testing (this is how I found it) or when people rely on the actual ordering and then toggle the flag.

Scienfitz commented 1 week ago

I don't think this should be labelled a bug, the order is inherent to the used method + order of parameters apssed

Unless there is an inconsistent handling of the latter from our side, this might not even be fixable from our end

In the provided example it seems like polars treats the parameters starting from the last while pandas starts from the first. If this is true we might be able to simply exploit this and reverse the order of parameters passed to the polars method

AVHopp commented 1 week ago

Agree, this should be investigated in a bit more detail. But can you elaborate a little bit how and why this caused issues in testing?

AdrianSosic commented 1 week ago

I had a deeper look and noticed that the problem is not because of the Cartesian part (that works just like the pandas counterpart) but due to the streaming=True flag in the collect call, which seems to destroy the order. Now this could be either because:

What do we make out of this?

Scienfitz commented 1 week ago

could there be a reorder step in which the fitlered polars resulting df is ordered according to the parameter-parameter-values order (ie the one Cartesian implies) ? Just .sort_values ? other than that I dont see a fix