Add weight importance to file format

pommedeterresautee commented 6 years ago

In many cases, it makes sense to say that an observation is more important than another one. Right now, there is no obvious way to provide this information. Would it be possible to add such feature? It would look like an optional key word to add to each sample (for instance __weight__:2.5). The effect would be to increase the loss (and the correction) for that sample.

What do you think?

jaseweston commented 6 years ago

Yes, it is possible to add, but I am not sure we intend to support this right now (option overload) unless many people request it. Note that you could also do this right now by replicating examples in the training file corresponding to their weight (but it's a bit of a pain and waste of memory, but still, you can).

pommedeterresautee commented 6 years ago

I am not sure the proposed solution would be the same (2 times a loss is not the same than seeing the same observation 2 times, because the second time a first correction has been applied).

Probably few users would request this feature if you don t explain the utility in the examples how to concretely use it (as you did with train modes). I ve seen it on XGBoost (I am from the DMLC team), few users open requests about advanced features in general, may be because most users on Github just play with tools but don't really use them (for now).

Usefulness of such feature would be easy to demonstrate. For instance content recommendation may take the price into account (or just a log of the price), sentence similarity applied to query suggestion (I have tried on our logs and it works much better than the classic graph approach) would take the interaction into account (download of a content should weight more than just opening a document, or may be the cumulative dwell time on documents following a query would make a nice weight), for classification the use of weight would be obvious in hierarchical class (common in extreme classification) or just because not all classes are created equal, etc.

In some way, with weights, you would give a direction to the similarity metrics for instance in trainmode = 3 (imagine in extreme case one sample has a weight of 0 and the other has a weight of 1). That way of doing would make trainmode = 4 obsolete (in particular if you drop all cases where where weight is <= 0, so there is no impact on performance). trainmode = 4 is just trainmode = 3 with one side with a weight of 0. And it would be also more general as it s not just 2 labels, but collection of words/labels...

I have another suggestion for feature overload: you may want to remove normalizeText, it is super dangerous to use, as during embedding predictions I have no idea what I should do to reproduce it (just lower case?). So I always disable it. Anyway, I have the feeling that this kind of things have to be done during dataset creation.

In query suggestion it would be a game changer to be able to bias the model toward requests which leads to clicks or downloads...

Without this feature, Starspace would still be a great program to rapidly have an idea of what we can expect from a simple strategy but would require to switch to MxNet/Torch... to be used in many real life scenario (this is true in my case for query suggestion, and I suppose in many other content recommendation usecases).

ledw commented 6 years ago

@pommedeterresautee Thanks for the suggestion. We'll set the default setting of normalizeText to false (in most cases it's just lower casing, see details in src/utils/normalize.cpp

We'll add the weighted example case in our list of features to implement.

pommedeterresautee commented 6 years ago

@ledw I was thinking that to improve results of sentence similarity, it may help to provide myself some specific negative examples (instead of random ones) and realised that a negative weight may do the trick... May be you want to add that usecase in some way in your examples when weighting will be implemented.

ledw commented 6 years ago

@pommedeterresautee support for weighted examples added in https://github.com/facebookresearch/StarSpace/commit/57c6212e89be5789487fbb96e860f5193ecad7fc.

The weighted examples needs to be in the following format: __weight__:xxx <tab> ...word_1 <tab> ... <tab> label_1 Note that it needs to have prefix __weight__ followed by ':' and the weight value. The weight block needs to appear first in the line that contains one example.

pommedeterresautee commented 6 years ago

Thanks a lot for this feature!! I have started to read the source code but I am still not sure to understand how it works, in particular how I may have several weights on one line where there are several examples on the line. Imagine I want to learn sentence similarities where sentence are queries. Some queries in a session are more important than others from the same session, so I want to make them more "attractive" than the others by increasing the loss attached to it. In some way they will attract the other queries to them. From my understanding I can have only one weight by line so there is still no way for me to attach a weight per query. Am I right?

To illustrate my question can I do : __weight__:0.1 query_1_word_1 query_1_word_2 query_1_word_3 <tab> __weight__:0.9 query_2_word_1 query_2_word_2 <tab>__weight__:0.5 query_3_word_1 query_3_word_2 query_3_word_3 ?

My question is based on your remark:

The weight block needs to appear first in the line that contains one example.

The main point is to make the learning asymmetric. The way I understand the implemented feature is that it makes the loss more or less important depending of the weight attached to a line but on one line all examples are equal.

ledw commented 6 years ago

@pommedeterresautee You're welcome! Your understanding is correct: the weight is added on the example level, and not doc level. The reason being that with different trainModes, it becomes complicated to add weights on doc level. I would suggest that, in your use case, to explode your examples into multiple examples, each containing different weight. If you're unclear about what I meant by 'explode', please share with me your script and sample data, and I can see if we can reduce that to cases that can be handled by having weights at example level.

pommedeterresautee commented 6 years ago

I was thinking to this solution too but to reproduce this way of working it means that if in a session I have for instance 40 queries, I need to generate 40 * 39 = 1560 pairs of queries (A -> B != B -> A because of the different weights). This may lead to really really big file and require a change in the way train files are created (so far everything is in Ram on a big server with lots of Ram). I have not yet tried but it s even possible it will require to sub-sample the dataset because of the hard drive limit. When examples are separated by a tab character, in which train mode there is a difficulty to put a weight per block of text? In particular, with labelDoc format, it would mean that the weight can be put everywhere, isn't it?

It seems that putting the weight per example make sense for almost all trainmodes:

trainmode 0 : when there is no labelDoc format, the weight is for the whole line (so if there are several labels, the weight is the same for all labels), when there is labeldoc the weight is put with the label so the weight depends of the label.
trainmode 1 : when there is no labeldoc format the weight is the same for all examples, and if labeldoc is activated the weight to take is the one of the picked label
trainmode 2 : same as trainmode 1 but the weight is the sum of the label weights (the ones not chosed as an input)
trainmode 3 : always several block of texts, so always one weight per block of text. By increasing the loss of some pairs by taking randomly one of their weight, weights act in this task, in some way, as it was increasing the chance of a block of text to be choosen in a pair to learn. If I understand correctly the pairs in an example are choosen randomly, but when weights are used, for a line there should be 2 times more randomly picked pair to take into account that some time you can try to learn A -> B (meaning learning a similirity between the texts + taking the weight of A) and B -> A (same but with the weight of B).
trainmode 4 : as said before trainmode 4 is just trainmode 3 with a weight of 0 on one of the example
trainmode 5 : there is only one text block per line (tab character has no meaning in this trainmode), so the weight will be applied to the whole sentence (as it is the case with the feature right now).

pommedeterresautee commented 6 years ago

@jaseweston @ledw hi, is there another way to achieve the same results without generating huge file?

jaseweston commented 6 years ago

@ledw -- can we reconsider this?

ledw commented 6 years ago

@pommedeterresautee Thanks for the suggestions. Apologize for the delay in replying. For integrating different weights for sentence in different trainMode, I have two questions:

For instance, in trainMode 1, we use sentence weight and word weight for each word, so in a combined document representation, each word would have the weight sum(sentence_weight * word_weight_in_sentence)?
In negative sampling: currently we do negative sampling by taking a random sentence from an example (i.e. trainMode 1, 3), in those cases, should the negatives use the sentence weight or not?

pommedeterresautee commented 6 years ago

If I understand you correctly, you are wondering what to do when you have a global weight for the example and a weight for a word (in trainmode 3 for instance)? (in my understanding, sentence weight = weight as done today and is related to the whole line, word weight = a weight which is applied to only one label in particular when labeldoc is on)

If my understanding is correct, IMO, you shouldn't have both weights at the same time. Otherwise, it would be complex for the user and not very useful, plus how would we recognize that it s a sentence weight or the weight of the first sequence/word?

In general :

labeldoc on : word weight
labeldoc off : global weight

In particular, regarding your propositions :

trainMode 1 : you can't have both weights at the same time
negative sampling : when there is a global weight (labeldoc off) the global weight is used, otherwise it s the weight of the picked negative samples which is used.

Does it make sense?

ledw commented 6 years ago

@pommedeterresautee thanks for the explanation. I think that makes sense. I'll change the case in labeldoc so we can handle sentence weight.

ncammarata commented 6 years ago

I just wanted to say that I had the same request, and was grateful to see this thoughtful discussion and execution.

pommedeterresautee commented 6 years ago

@ncammarata may I ask you what is your use case? It may give everybody ideas on how to leverage starspace representation

ncammarata commented 6 years ago

@pommedeterresautee I'm interested in using it for a network scenario. I want to be able to construct a graph of the system I'm working with, which in my case is connected operating systems with shared files, and arbitrarily make predictions within it. So I can say that X person looked at file Y for Z minutes, so what other files should I prioritize for their viewing.

Except I can basically choose any part of my graph to make these kinds of predictions. I haven't actually done any work with StarSpace on this yet, but that's my desired outcome, and I was happy to see weighted connections looking possible.

Does that make sense?

ledw commented 6 years ago

Closing the issue as weight importance is added.

facebookresearch / StarSpace

Add weight importance to file format #99