datduong / NLPMethods2CompareGOterms

Natural language processing methods to compare 2 Gene Ontology terms
5 stars 2 forks source link

Understanding of w2vGO model #2

Open TaoDFang opened 6 years ago

TaoDFang commented 6 years ago

Hi guys,

Can I ask you somethings about Word2vec model implementation? Here are algorithm you introduced in paper: image

And below are key codes to implement this algorithm:

score = w2v2GoTerms ( pair[0], pair[1], annotationBP, hausdorffDistMod1to2Wted, model )

def w2v2GoTerms (goId1,goId2,goAnnot,func2measure,model):
    # REAL GO:ID WITHOUT THE "GO:"
    # func2measure: a function to measure 2 sentences, example: sim2Sentences(s1,s2,model)
    return func2measure(goAnnot[goId1],goAnnot[goId2],model)

def hausdorffDistMod1to2Wted (v1,v2,model): 
    # worstOfBest = 1 
    # ave = 0 
    # for v11 in v1: ## one word in vector 1. 
        # ave = ave + findWordBestMatch(v2,v11,model)[1] ## closest of v11 to sentence 2
    # ave = ave / len(v1) 
    v1tov2 = np.array ( map(lambda x: findWordBestMatchWted(v2,x,model)[1:3], v1)   )## each element of v1 to the whole v2
    ''' 
    v1tov2 = each row is [best similarity score, weight of the pair]
    '''
    # print v1tov2
    # infoV1 = np.array ( map (lambda x: infoContentOfWord(x,model=model), v1) )
    # print infoV1
    return np.average(v1tov2[:,0],weights=v1tov2[:,1])

def findWordBestMatchWted(v1,w1,model): 
    if w1 in v1: 
        return [w1,1.0,infoContentOfWord(w1,model)**2]
    ## not not the same word in the vector. then... 
    wordArray = np.repeat(w1,len(v1),axis=0).tolist()
    ret = np.array ( map(model.similarity,wordArray,v1) ) ## simple cosine distance (angle). NOT WEIGHTED 
    infoW1 = np.array ( map (lambda x: infoContentOfWord(x,model=model), wordArray) )
    infoV1 = np.array ( map (lambda x: infoContentOfWord(x,model=model), v1) )
    # infoV1 = infoV1 / np.sum(infoV1)
    pairWt = infoV1 * infoW1 ## WEIGHT OF THE BEST MATCHING PAIR 
    # ret = ret * pairWt ## similarity * word information 
    max = np.max(ret)
    maxIndx = np.where(ret==max)[0][0]
    maxWord = v1[maxIndx]
    maxInfo = pairWt[maxIndx]
    return [maxWord,max,maxInfo]

def infoContentOfWord (w1,model): 
    # print w1
    # return -1*np.log ( model.vocab[w1].count*1.0 / model.corpus_count )
    a = np.log(model.vocab[w1].count)
    b = np.log(model.corpus_count)#len(model.vocab)) #(model.corpus_count)
    return 1.0 - a/b 

However, I am confused about the codes. First, I don't understand why function "infoContentOfWord " can return "-1np.log ( model.vocab[w1].count1.0 / model.corpus_count )". It seemed to me not right. Second, in equation(1), there are two terms in angle brackets. But in function "w2v2GoTerms", it seemed that only one term were calculated. And I checked the old version(2017), as I can see, both of these two terms are calculated.

Would you please explain a bit for this because I am really interested in your algorithm?

datduong commented 6 years ago

Hi Tao, I will get back to you shortly today or tomorrow. Thanks for your patience.

TaoDFang commented 6 years ago

Hi,

No worries. Thanks a lot for your reply and help!

datduong commented 6 years ago

Hi Tao, let me explain a little bit more in detail on how I compare the 2 sentences.

First, I treat each sentence as an set of words (ignoring all the ordering of the words). Second, the problem reduces down to comparing 2 sets. I am using Hausdorff distance to compare the sets; however, I need to consider that some words are very popular (like "the", "a", "this"). I use the information content of the word to upweigh or downweigh their contributions to the sentence.

To compute the information content of a word, I use the formula in this paper: Sentence Similarity Based on Semantic Nets and Corpus Statistics by Yuhua Li et al. (Equation 12).

Next, the function findWordBestMatchWted compares each word in sentence Z to sentence V. For example, if sentence Z has 5 words and sentence V has 3 words, then the output of findWordBestMatchWted(Z,V) is an average of a vector of length 5. findWordBestMatchWted(Z,V) is the first line of my Equation 1.

If you do findWordBestMatchWted (V,Z), then output is an average of a vector of length 3. findWordBestMatchWted (V,Z) is the second line of my Equation 1.

The hausdorffDistMod1to2Wted function calls findWordBestMatchWted. hausdorffDistMod1to2Wted(Z,V) will return the distance of set Z to set V. The distance(Z,V) is not the same as distance(V,Z) by definition of Hausdorff distance.

Finally, in the function hausdorffDistModWted calls both hausdorffDistMod1to2Wted(Z,V) and hausdorffDistMod1to2Wted(V,Z).

TaoDFang commented 6 years ago

Hi datduong,

Thanks a lot for your time and kind explanation.

If I understand you correctly, Then the method you used to calculate information context is different with the one you mentioned in the paper.

And I know function hausdorffDistModWted calls both hausdorffDistMod1to2Wted(Z,V) and hausdorffDistMod1to2Wted(V,Z). However, it seemed that you didn't call function hausdorffDistModWted but call funtion hausdorffDistMod1to2Wted just once instead.

And for function findWordBestMatchWted, I notice you calculated a production term(pairWt = infoV1 * infoW1) as the weights. However, in the paper, it seems you used single information content directly. I am still a little puzzled abut this.

Looking forward to your reply.


image


datduong commented 6 years ago

You're correct. Now, I see why this is wrong. The main file w2vCompareGO.py calls w2v2GoTerms only once. For example in line 53, it should be saying

score = w2v2GoTerms ( pair[0], pair[1], annotationBP, hausdorffDistMod1to2Wted, model )
scoreBtoA = w2v2GoTerms ( pair[1], pair[0], annotationBP, hausdorffDistMod1to2Wted, model ) ## to reverse the pairing`
score = some-function-to-combine (score, scoreBtoA)

I will change the code. In the result section, I didn't use this script. I had another script that directly reads in the genes and the GO annotation. It was easier to just directly work at the gene-level instead of the GO-term level, because when doing gene-level, I was able to parallelize a lot of gene pairs. Also each gene pairs has only a very small 2x2 score matrix for their GO terms. This way, I did not need to compute a very large 2x2 score matrix for all the GO terms.

I use the function w2v2GoSetsArr in the file func2getSimOf2GoTermsV2.py. Here you will see that the inputs are sets of GO (not 2 single GO terms). I will post another code that works directly at the gene-level very soon.

Thanks for catching this problem.

TaoDFang commented 6 years ago

Thanks so much for your reply. Now , I am totally understand you codes except one function. Just the the last question I raised in last comment. May I ask you why you use the information content production between two word as weights instead of information content of a single word as weights?

datduong commented 6 years ago

Hi Tao, again, you're correct. The equation using only content(z) for word z is wrong. It should be using both content(z) * content(w). I experimented with different ways to weigh the words. I found that using both content(z) * content(w) gives me the best results. This strategy came from the observation that when I compare the blue car versus this cat is sleeping, I should have the weights so that weight(the,this) < weight(the,cat).

Thanks for catching this equation. I will fix the pdf.

TaoDFang commented 6 years ago

Hi Datduong, thanks a lot for your patience and kind reply. It helps me lot!

datduong commented 6 years ago

No problem. Feel free to contact me if you need anything else. Thanks for trying the code.

On Tue, Apr 24, 2018 at 1:14 AM TaoDFang notifications@github.com wrote:

Hi Datduong, thanks a lot for your patience and kind reply. It helps me lot!

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/datduong/NLPMethods2CompareGOterms/issues/2#issuecomment-383844176, or mute the thread https://github.com/notifications/unsubscribe-auth/AJnTGEOfjxZovHWXz7ZiLNEmZwZIg1cDks5trt7jgaJpZM4Tbm8f .