trainMode = 0: labelDoc format

zeyuanchen23 commented 5 years ago

Hi!

May I ask a question about trainMode 0? The document says the labels can be bags of features. So I think we can use the format below and set "-fileFormat labelDoc"

word_1 word_2 ... word_k <tab> label_1_word_1 label_1_word_2 ... <tab> label_r_word_1

In this case, each row will be a query sentence and a set of related sentences / documents. But in testing, why do we still need provide a basedoc? Would you please give more instructions on the format of basedoc in this case (e.g. what each row should look like)? Thanks!

ledw commented 5 years ago

Hi @zeyuanchen23: sorry for the delay in responding. In test, it predicts labels comparing from label candidates. When the file format is labelDoc, the labels are bag of words. Therefore one need to provide basedoc -- a collection of labels to compare to. The format of basedoc: each line is a label candidate, which consists of label_1_word1 label_2_word_2 ...

zeyuanchen23 commented 5 years ago

Hi @ledw

Thanks for the clarification! I have a following question about the test.

Suppose both the input and labels are sentences. Based on your comments, it seems that the model compares the input (i.e. a sentence that consists of several words) with the ground-truth label (i.e. another sentence) and all other candidate labels in the basedoc, and then ranks them according to their similarity to the input.

However, I found sometimes the model outputs sentences that are neither from the candidate set, nor from the ground-truth labels. For instance, one output could be a sentence that is similar to a candidate in basedoc, but with a few missing words. Is it because the model doesn't treat each sentence (i.e. a label) as a whole? Or is it because of some normalization issues? Btw, I turned off the normalizeText and set minCount to 1, but still have this issue.

Thanks for your help!

ledw commented 5 years ago

@zeyuanchen23 you're welcome. the output sentences should only from the candidate set or ground-truth labels. Could you give me some examples to look at? This should not happen if normalizedText is off and minCount is 1.

ledw commented 5 years ago

Closing this as no recent update. @zeyuanchen23 feel free to re-open.

facebookresearch / StarSpace

trainMode = 0: labelDoc format #229