Numerical issue with entropy regularization in `generic_conditional_gradient`

kachayev commented 1 year ago

Describe the bug

ot.optim.generic_conditional_gradient uses the following cost function for line search (when reg2 is provided):

def cost(G):
    return nx.sum(M * G) + reg1 * f(G) + reg2 * nx.sum(G * nx.log(G))

When matrix G contains negative values, nx.log(G) returns nan with a RuntimeWarning.

Code sample

>>> from ot.da import SinkhornL1l2Transport
>>> from ot.datasets import make_data_classif
>>> ns, nt = 50, 50
>>> Xs, ys = make_data_classif('3gauss', ns)
>>> Xt, yt = make_data_classif('3gauss2', nt)
>>> otda = SinkhornL1l2Transport()
>>> otda.fit(Xs=Xs, ys=ys, Xt=Xt)
ot/backend.py:1082: RuntimeWarning: invalid value encountered in log
  return np.log(a)
<ot.da.SinkhornL1l2Transport object at 0x10327cb50>

This example is used for test_sinkhorn_l1l2_transport_class in test/test_da.py.

Expected behavior

The question now is negative values are expected to happen while performing line search? If yes, should we evaluate the cost to -inf in this case? Or is it okay for the cost to be nan (if this is the case, we should short-circuit the call to avoid warnings).

Environment (please complete the following information):

macOS-13.4-arm64-arm-64bit
Python 3.8.16 (default, Mar  1 2023, 21:18:45)
[Clang 14.0.6 ]
NumPy 1.24.3
SciPy 1.10.1
POT 0.9.1

rflamary commented 1 year ago

G should not contain negative value by construction (it is a scaled gaussian kernel when using sinkhorn) so this should not occur. Could you have a look at what is happening? Maybe we get steps in the FW that are larger that 1?

kachayev commented 1 year ago

Yeah, that's why I was confused with this. I will try to find why exactly this happens.

kachayev commented 1 year ago

It turns out that scalar_search_armijo suggests negative alpha in this callback. Which leads to initial function

f(xk0 + alpha10 * pk0)

to get value outside of the interpolation range (in our case, this "outside" happens to have negative values as it would be expected because matrix G has a lot of items pushed to almost 0).

This is somewhat unexpected behavior from scalar_search_armijo as it has default amin=0 parameter. Though, in fact, there's no check for the boundary here - it computes the minimizer of a quadratic interpolant, executes callback phi, and returns if condition is met without checking for amin (the check happens only later in the code).

One possible solution to this would be to check if alpha1 suggested by the search is outside of the min/max range on our side (e.g. by returning inf value from the phi callback). But I'm not sure that this is what supposed to happen in the algorithm in the first place: original line_search_armijo works with the projection onto search direction given by the gradient (i.e. derphi0 = np.dot(gfk, pk). In POT we have derphi0 = np.sum(pk * gfk), and I'm not sure that this value should guarantee that quadratic interpolation doesn't break the min/max condition. WDYT?

rflamary commented 1 year ago

Ouch this one is subtle. Since scalar_search_armijo does not work as expected (does not respect amin) i would add a quich threshold outsite of its call to force a positive or 0 step.

Are you saying that in the function sometime the gradient is not an ascent direction? This is possible since at each iteration we solve a sinkhorn that might not have converged, in this case setting a step of 0 (because numerical precision prevents us to get descent direction) seems reasonable.

kachayev commented 1 year ago

I can surely add additional check into the callback function. I still don't have an intuitive feeling for this to be the right solution, though. I will run more tests.

kachayev commented 1 year ago

Are you saying that in the function sometime the gradient is not an ascent direction?

So, this is from the debugger. I'm just evaluating phi (this one here) over a linspace between 0. and 0.99. In the first case, you can see that the cost goes down. While on the second loop it clearly goes up. Not sure why, Sinkhorn does converge in this case.

np.array([phi(x) for x in np.linspace(0., .99, 20)])
array([ 2.7194557 ,  2.29096204,  1.87171902,  1.46029814,  1.0558716 ,
        0.65791566,  0.26609441, -0.11979583, -0.4998549 , -0.87409216,
       -1.24242958, -1.60469608, -1.96061174, -2.30975788, -2.65152358,
       -2.98500709, -3.30881903, -3.62062802, -3.91580393, -4.17963043])

np.array([phi(x) for x in np.linspace(0., .99, 20)])
array([-4.17963043, -4.16778306, -4.14616826, -4.11810441, -4.08496656,
       -4.04752903, -4.00628706, -3.96157971, -3.91364772, -3.86266445,
       -3.80875372, -3.75200051, -3.69245742, -3.63014829, -3.56506962,
       -3.49718998, -3.42644711, -3.35274138, -3.27592209, -3.19574438])

PythonOT / POT