Idea: Double sided operators and sequence of operators

simonepri commented 5 years ago

Have you explored the idea of providing the ability to specify more than one operation for each relation?

Something like that:

'operators': [{
  'type': 'linear',
  'side': 'both', # Apply this operator on both lhs and rhs
}, {
  'type': 'translation',
  'side': 'left',  # Apply this operator on lhs only
}]

To give some context, my final goal is to model TransD. In TransD the embeddings of the entities need to be projected into another embedding space before having the translation operator applied.

I've forked the repo and done part of the changes needed to achieve this but I'd like to discuss whether there's interest in having something like that into BigGraph.

lw commented 5 years ago

I wasn't familiar with TransD until now, so all I know is from the equations you posted.

In its simplest meaning, "sequences" of operators can just be implemented by a custom operator that internally applies all the sub-operators in sequence. However, even in this case and even with custom comparators I still think it's hard to implement the TransD model "naturally". The primary reason is that if you want to do the calculation in the comparator you don't really have access to the relation parameters while if you want to do it inside the operator you can only do it on one side.

So, if we want to be able to support this, there's some "deeper" invariant/assumption of PBG that needs to be challenged or removed. I tried to put some of these principles down in writing, these are the ones I came up with:

operators are supposed to output something that still looks like an embedding (i.e., a vector, of the original input dimension)
operators should not be applied to both sides as they are the only thing that breaks the asymmetry between source and destination entity and thus allow to encode the direction of edges
operators are also the ones that differentiate the score computation for an edge based on its type, that is, operators are different and specific for each relation type, while comparators are unique and "global" (I guess this all means that operators must be maps from the embedding space onto itself).
comparators must be symmetric

I'm not sure which one(s) we need to break in order to support the TransD model. Also, I don't know how deeply rooted each of these are and if the effort to eradicate them is actually worth it. It appears to me that your original proposal wants to break the first two principles.

On the other hand, I consider it ok to encode separate "structured" data in the embedding tensors, and have some components interpret it in particular ways. For example, our complex diagonal operator interprets the first D/2 dimensions as the real coordinates and the other ones as the imaginary coordinates. So one approach here could be to also encode h and h_p in the same vector and have the operators and/or the comparators break them apart and use them in different ways.

simonepri commented 5 years ago

It appears to me that your original proposal wants to break the first two principles.

Mostly the second I would say. (At least in TransD) the entity and relation space can have the same dimensions. The original thought was that having the possibility to define multiple operators would have made the framework more flexible, but I understand that this can break many assumption that were made in the implementation. (Indeed I had problems implementing it)

I'm not sure which one(s) we need to break in order to support the TransD model. Also, I don't know how deeply rooted each of these are and if the effort to eradicate them is actually worth it.

What about introducing a new pre-operator that is symmetric only? (any suggestions for a better name are welcomed) In that way, none of the principles are touched, and we just need to apply the symmetric pre-operator with adjust_embs before applying the "real" asymmetric operator. This approach has the advantage of not introducing any breaking change in the config structure.

Here you can find a draft of the changes required (At the moment if dynamic_relations is true it throws Shape doesn't match: (1000) != (1, 109108) on this line )

On the other hand, I consider it ok to encode separate "structured" data in the embedding tensors, and have some components interpret it in particular ways. For example, our complex diagonal operator interprets the first D/2 dimensions as the real coordinates and the other ones as the imaginary coordinates. So one approach here could be to also encode h and h_p in the same vector and have the operators and/or the comparators break them apart and use them in different ways.

That's indeed exactly what I've done to implement TransD's projection operator:

class ProjectionOperator(AbstractOperator):

    def __init__(self, dim: int):
        super().__init__(dim)
        if dim % 2 != 0:
            raise ValueError("Need even dimension as 1st half is the real "
                             "embedding and 2nd half is projection vector")
        self.rproj = nn.Parameter(torch.zeros((self.dim // 2,)))

    def forward(self, embeddings: FloatTensorType) -> FloatTensorType:
        match_shape(embeddings, ..., self.dim)
        emb = embeddings[..., :self.dim // 2]
        eproj = embeddings[..., self.dim // 2:]
        proj = torch.zeros_like(embeddings)
        proj[..., self.dim // 2:] = torch.sum(eproj * self.rproj, dim=-1, keepdim=True) * emb + emb
        return proj

class ProjectionDynamicOperator(AbstractDynamicOperator):

    def __init__(self, dim: int, num_operations: int):
        super().__init__(dim, num_operations)
        if dim % 2 != 0:
            raise ValueError("Need even dimension as 1st half is the real "
                             "embedding and 2nd half is projection vector")
        self.rprojs = nn.Parameter(torch.zeros((self.num_operations, self.dim // 2)))

    def forward(
        self,
        embeddings: FloatTensorType,
        operator_idxs: LongTensorType,
    ) -> FloatTensorType:
        match_shape(embeddings, ..., self.dim)
        match_shape(operator_idxs, *embeddings.size()[:-1])
        emb = embeddings[..., :self.dim // 2]
        eproj = embeddings[..., self.dim // 2:]
        rproj = self.rprojs[operator_idxs]
        proj = torch.zeros_like(embeddings)
        proj[..., self.dim // 2:] = torch.sum(eproj * rproj, dim=-1, keepdim=True) * emb + emb
        return proj

I wasn't familiar with TransD until now, so all I know is from the equations you posted.

If you are interested have a look to this article where you can find a lists of many models, most of which require a symmetric operator that depends on the relation (TransH, TransR, CTransR) or on both the relation and the entity (TransD).

lw commented 5 years ago

I thought about this for a while and I am seeing one fundamental issue with introducing operators on the left-hand sides of edges (even if they are just "pre-operators"), namely that they would quite degrade the experience with dynamic relations. With dynamic relations enabled a single batch contains edges of multiple relation types. Within each chunk of a batch (i.e., the "sub-batching" we use for negative sampling) each entity on each side is combined with every other entity on the other side to produce a negative edge. If done naively this means that to each entity of a batch we may need to apply as many operators as there are entities in a chunk. In order to, instead, only apply one operator to each entity, we use an odd negative sampling mechanism for dynamic relations, where we introduce fictitious "reversed" relation types and for each of the forward and backward relation types we only sample negatives on the side on which no operators are applied. More details can be found in the docs and in the code:

https://github.com/facebookresearch/PyTorch-BigGraph/blob/9bd729d87d6370cced6f6ff56ecff7ebc335f5ab/torchbiggraph/model.py#L908-L914

https://github.com/facebookresearch/PyTorch-BigGraph/blob/9bd729d87d6370cced6f6ff56ecff7ebc335f5ab/torchbiggraph/model.py#L939-L953

In short, supporting the dynamic relations mode as a first-class citizen is important to us and I'm worried that any way to solve this issue would end up providing a worse performance there than with "standard" batching. If you want to keep going with this I'd have to see a proof-of-concept that illustrates that the performance degradation is either absent or that the gains in model accuracy are significant enough to warrant such a degradation.

simonepri commented 5 years ago

If you want to keep going with this I'd have to see a proof-of-concept

I'm trying to do so, but the prepare_negatives is causing me some troubles since I'm not fully aware of all the choices made for the various sampling strategies. I would really appreciate your help on this.

For what I understood reading the code, prepare_negatives is used to: 1) Sample negatives from the side where the operator is NOT applied 2) Prepare the negatives for the comparator 3) Create a mask to ignore some scores

Why the function adjust_embs is not applied to the sampled embeddings when the sampling strategy is UNIFORM? To me, it looks like that when UNIFORM is used, the negative embeddings are not "prepared" using the prepare function of the comparator. https://github.com/facebookresearch/PyTorch-BigGraph/blob/bd3b22c3cd612db29d04a427ee4bd9bca019b0df/torchbiggraph/model.py#L806-L809

Can you please explain me what this try catch is used for and why positive embeddings are concatenated with the sampled negatives? https://github.com/facebookresearch/PyTorch-BigGraph/blob/bd3b22c3cd612db29d04a427ee4bd9bca019b0df/torchbiggraph/model.py#L809-L824

lw commented 5 years ago

As you realized, the lack of "preparation" for the uniform negatives was a bug (#70).

In general, prepare_negatives is used to determine which entities/embeddings will be the negatives on both sides (we call the method twice, once per side). Leaving aside the "all negatives" mode, which is its own beast, there are two ways of producing negatives: "batch", which reuses the positives of the current batch as negatives, and "uniform", which samples them randomly. You can have one of them, neither or both. When we use both, we concatenate the positives (which are also the batch negatives) with the uniform negatives to obtain the set of all negatives of both types. We produce a mask which we later use to ignore some scores, because some "negative" scores may in fact come from positive edges (since we reuse positive entities as negatives) and thus shouldn't be counted.

The reason all this code is so convoluted (and that it's hard to "extract" it to a separate negative sampler class) is that it needs to call the adjust_embs method, but only on the "fresh" embeddings, i.e. the uniformly-sampled ones, because the batch negatives have already been adjusted (since they are positives). This creates a lot of interactions that are hard to untangle.

Hope this helps in guiding you, if I forgot to answer something let me know.

Curious-Geek commented 4 years ago

Let's say I have an edge like this: x1, r, y1 Is it possible to fix the embedding for X1 and generate embeddings only for y1?

lw commented 4 years ago

@Curious-Geek As your is a separate question could you open a new issue for it?

facebookresearch / PyTorch-BigGraph

Idea: Double sided operators and sequence of operators #52