Implement a C^k with k>2 coupling layer with analytic inverse.

EiffL commented 2 years ago

I'm opening this issue to continue the discussion of #2 regarding the development of a smooth coupling layer of high enough order.

Our preliminary results from that thread are the following:

Conventional coupling layers and neural networks used in flows are not appropriate to training by score matching because they are not smooth enough
We can define coupling layers of higher smoothness following for instance https://arxiv.org/pdf/2110.00351.pdf
@Justinezgh has started to prototype analytic bijections needed for such a coupling layer, with analytic inverses.

EiffL commented 2 years ago

Ok, so we had a question here: Can we check that for whatever value of a>0 and b \in (0,1), the end function f is indeed still a bijection in [0,1]

b-remy commented 2 years ago

The demonstration is already in the supplementary information of Köhler et al., they demonstrate that sigma(rho)(x) is strictly monotonously increasing and bounded, hence bijective. Combination with strictly increasing affine function still gives a bijetor. They also show the smoothness.

Maybe I didn't get the question correctly :-) did you mean experimentally?

EiffL commented 2 years ago

yeah no, the question @Justinezgh had was whether this held for any value of a and b. and hadn't really thought about that.

What do you think Justine? do you have a different way of phrasing the question, or do you think it's ok in the end? (havent had time yet to dive into it myself)

Justinezgh commented 2 years ago

So they say that :

In our case omega is [-1/2a + b , 1/2a + b] with a>0 and b in [0,1] and I think that considering that ^^^ [-1/2a + b , 1/2a + b] should at least contain [0,1] to have a diffeomorphism on [0,1]. But with these only constraints : a>0 and b in [0,1], we can choose for instance a = 5 and b = 0.5 ( + c = 0.5) and in this case omega is [0.4 , 0.6] and for rho(x) = exp(-1/x**2) we have this :

https://colab.research.google.com/drive/1kQ-CZpBlqCT6Th03hp5A8MNZ-sc_gA3p?usp=sharing

Justinezgh commented 2 years ago

I think that if we want to use those constraints we have to choose rho in a way that sigma is a bijection on R which is the case for rho(x)=x**3 (I think)

EiffL commented 2 years ago

hummmmmm I see, yeah, so it looks like r^k is working as expected,right? is it that the problem is that the rho function should be within [0,1]

Arf, no sorry... Hum not obvious why r^3 works but not the exp model.....

Justinezgh commented 2 years ago

I think that x**3 is the only one that works.

rho(x) = x**2 a = 5 b = 0.5 c = 0.001

rho(x) = x**5 a = 5 b = 0.5 c = 0.001

Justinezgh commented 2 years ago

So for me we have 2 possibilities :

we choose any rho but we have to change the constraints
we keep the constraints but we have to choose rho in a way that it's a bijection on R

and so this is the reason why I don't understand how a and b can be learned

EiffL commented 2 years ago

@b-remy has another explanation that maybe he can explain here. But essentially, despite what is said in the text, f is a bijection from R to I

So, one way to deal with that would be to add a Scaleshift bijector as input that would rescale I to R before sending it to our bijector

I'm just not sure if in doing so we remove the expressivity of the layer.... To be tested. So Justine can you try adding such a bijector in your coupling layer? So that the input and output of the layer remain in [0,1]

Justinezgh commented 2 years ago

The constraints if we want to have [0,1] in omega

6DC602B8-46FF-4ED5-8170-60BB2DBFA002

Justinezgh commented 2 years ago

I'm not sure about the bijection f from R to [0,1]. To have a bijection on R, sigma has to be a bijection on R too (or c has to be "big") even if [0,1] is in omega :

rho(x) = x**4 a = 0.5, b = 0.2, c = 0.05

and f(x) with x in R can't be in [0,1] because of the c*x

b-remy commented 2 years ago

Yes I agree that there isn't necessarily a bijection from R to [0,1]. There's for sure a bijection f from [-1/2/a +b, 1/2/a +b] to [0,1] as you said above. I also agree that it does not make a bijection f from [0,1] to [0,1], if [0,1] isn't in [-1/2/a +b, 1/2/a +b]. So maybe we need to enforce the constraint you derived above...

Otherwise maybe a silly question: can we have several ramp bijectors on multiple intervals [-1/2/a +b, 1/2/a +b] with continuous contraints so that we make sure [0,1] is within the union?

Justinezgh commented 2 years ago

The constraints if we want to have [0,1] in omega

So to keep it simple we can say that b is in [0.1 , 0.9] and a is in ]0 , 0.5] (because 1/(2(1-b)) with b = 0.1 is 0.56) idk :/ The thing is that if we use a conditional activation function I think that we will have the same pb as with relu

EiffL commented 2 years ago

hummmmmm its really not clear to me, what you suggest looks like it should work, but I'm surprised it's not mentioned at all in the paper. Can we try to find their source code somewhere and check their implementation?

Justinezgh commented 2 years ago

Benjamin you mean for g ? I'm not sure I understand

Justinezgh commented 2 years ago

I can't find their code :/

b-remy commented 2 years ago

Benjamin you mean for g ? I'm not sure I understand

I meant f, it is the same for g since g is a bijective function only from [-1/2/a+b, 1/2/a+b] to [0,1], actually I did not add anything useful to the discussion ^^ but now I get your point that we dont have necessarily a bijection [0,1] to [0,1]. I dont understand how they do it either...

To understand what's going on outside [0,1], you can easily show that if you want sigma to be bijective for all k (odd or even), you need to take x in [0,1]. All R works for odd k, but only x in [0,1] for even k (by considering the derivative of sigma > 0).

Justinezgh commented 2 years ago

Axel helped me to find this : https://github.com/noegroup/bgflow/pull/28

aboucaud commented 2 years ago

So what should we look for in this repo ? The coupling flow is there https://github.com/noegroup/bgflow/blob/main/bgflow/nn/flow/coupling.py#L133

EiffL commented 2 years ago

https://github.com/noegroup/bgflow/blob/191465c29325d07d371c3bcc2e380716acf4666d/bgflow/nn/flow/transformer/compact/compact.py#L195

EiffL commented 2 years ago

In fact.... this looks like the piece of code we are interested in: https://github.com/noegroup/bgflow/blob/52f9e32a8d3b432ce559fa857ee7dc17997e9b96/bgflow/nn/flow/transformer/sigmoid.py#L62 I would say that here this is the definition of this sigmoid coupling layer, the code I have in the post above that where they define the ramp functions.

After some reverse engineering, I think I got it! Here is the translation table: a = sigma, b = mu, c = min_density And here is the coupling function:

def affine_sigmoid_transform(x, a, b, c, ramp=rho):

  b = (b - 1./(2*a))

  # Here we apply a rescaling by a and b first, before rho
  diff = x - b
  zs = jnp.stack([diff, -b*jnp.ones_like(x), 1 - b*jnp.ones_like(x)],axis=0)
  zs = zs * a

  def sigma(x):
    return rho(x) / (rho(x) + rho(1 - x))

  y, y0, y1 = sigma(zs)

  # This is the equivalent of f
  y = (y - y0)/ (y1 - y0)

  y = y*(1-c) + c *x

  return y

Here is the result for an exponential function and the function defined above \o/

EiffL commented 2 years ago

hum actually not completely solved, it doesnt work for a > 1

EiffL commented 2 years ago

need to check carefully how this function restricts the values of a,b,c that come from the NN: https://github.com/noegroup/bgflow/blob/52f9e32a8d3b432ce559fa857ee7dc17997e9b96/bgflow/nn/flow/transformer/sigmoid.py#L78

Justinezgh commented 2 years ago

need to check carefully how this function restricts the values of a,b,c that come from the NN: https://github.com/noegroup/bgflow/blob/52f9e32a8d3b432ce559fa857ee7dc17997e9b96/bgflow/nn/flow/transformer/sigmoid.py#L78

ok I think that I'll need to think about it a bit more ^^' but I did that : https://colab.research.google.com/drive/1mRGwrsJl0ldW9wul3I1DsdfYjC1IjcOO?usp=sharing

EiffL commented 2 years ago

Lol, yep, I think that's the black magic we had been looking for :-D Could you try to modify the code in https://github.com/astrodeepnet/sbi_experiments/pull/10 accordingly? Maybe only the notebook needs to be changed... I'm not sure.

EiffL commented 2 years ago

Ok this is great @Justinezgh if you want to add your analytic inverse code, which would be nice for comparison, you can open a Pull Request for this, I'll be happy to review it ;-)

Justinezgh commented 2 years ago

In fact.... this looks like the piece of code we are interested in: https://github.com/noegroup/bgflow/blob/52f9e32a8d3b432ce559fa857ee7dc17997e9b96/bgflow/nn/flow/transformer/sigmoid.py#L62 I would say that here this is the definition of this sigmoid coupling layer, the code I have in the post above that where they define the ramp functions.

After some reverse engineering, I think I got it! Here is the translation table: a = sigma, b = mu, c = min_density And here is the coupling function:
def affine_sigmoid_transform(x, a, b, c, ramp=rho):

  b = (b - 1./(2*a))

  # Here we apply a rescaling by a and b first, before rho
  diff = x - b
  zs = jnp.stack([diff, -b*jnp.ones_like(x), 1 - b*jnp.ones_like(x)],axis=0)
  zs = zs * a

  def sigma(x):
    return rho(x) / (rho(x) + rho(1 - x))

  y, y0, y1 = sigma(zs)

  # This is the equivalent of f
  y = (y - y0)/ (y1 - y0)

  y = y*(1-c) + c *x

  return y
Here is the result for an exponential function and the function defined above \o/

But this function is exactly the same that the one describe in the paper right ? And I'm not sure that those constraints :

    log_a_bound=4
    min_density_lower_bound=1e-4

    log_a   = jnp.exp(jax.nn.tanh(hk.Linear(output_units)(net)) * log_a_bound)
    b   = jax.nn.sigmoid(hk.Linear(output_units)(net))
    c   = min_density_lower_bound + jax.nn.sigmoid(hk.Linear(output_units)(net)) * (1 - min_density_lower_bound)

solved the pb. Just to be sure, I tried the nf (notebooks/score_matching/NF_implicit.ipynb) with rho(x) = x^6 and rho(x) = exp(-1/(x^2) (f is not a bijection on [0,1] for these rho :

) and the nf crashed. I think that it works for rho(x) = x^5 because as Benjamin mentioned f is a bijection if k is odd. (Or maybe I'm just super confused ^^')

EiffL commented 2 years ago

So I don't know if you have tried it as well, but here is a notebook to try the different bijectors from bgflow for comparison: https://colab.research.google.com/drive/1ar7QkUSPxMfMnM4P8SC4RpOp2MK2fgO3?usp=sharing

Justinezgh commented 2 years ago

This is the reason why I have always 'nan' in my loss : And I don't think that I can have a better inverse with sympy

Justinezgh commented 2 years ago

So I don't know if you have tried it as well, but here is a notebook to try the different bijectors from bgflow for comparison: https://colab.research.google.com/drive/1ar7QkUSPxMfMnM4P8SC4RpOp2MK2fgO3?usp=sharing

I will look at it, thanks :)

EiffL commented 2 years ago

Hummmm I see.... Well I think maybe there is something wrong with our bijector not being exactly what it should be, which could explain why your inverse fails outside a given range.

But more generally, now that we have an implicit inverse that seems to work maybe we don't need to spend too much time on the analytic inverse, which we know is going to be limited to some special cases.

Whatever the way to achieve it, our goal is to have a normalizing flow that we can train from its score.

Justinezgh commented 2 years ago

yup, I tried with strong constraints on a and the results are not good at all since we lose a lot of expressivity. Even if I increase the number of coupling layers

EiffL commented 2 years ago

I think.... we can close this issue now, right?

astrodeepnet / sbi_experiments

Implement a C^k with k>2 coupling layer with analytic inverse. #6