Open TaoDFang opened 6 years ago
Hi Tao, I will get back to you shortly today or tomorrow. Thanks for your patience.
Hi,
No worries. Thanks a lot for your reply and help!
Hi Tao, let me explain a little bit more in detail on how I compare the 2 sentences.
First, I treat each sentence as an set of words (ignoring all the ordering of the words). Second, the problem reduces down to comparing 2 sets. I am using Hausdorff distance to compare the sets; however, I need to consider that some words are very popular (like "the", "a", "this"). I use the information content of the word to upweigh or downweigh their contributions to the sentence.
To compute the information content of a word, I use the formula in this paper: Sentence Similarity Based on Semantic Nets and Corpus Statistics by Yuhua Li et al. (Equation 12).
Next, the function findWordBestMatchWted
compares each word in sentence Z to sentence V. For example, if sentence Z has 5 words and sentence V has 3 words, then the output of findWordBestMatchWted(Z,V)
is an average of a vector of length 5. findWordBestMatchWted(Z,V)
is the first line of my Equation 1.
If you do findWordBestMatchWted (V,Z)
, then output is an average of a vector of length 3. findWordBestMatchWted (V,Z)
is the second line of my Equation 1.
The hausdorffDistMod1to2Wted
function calls findWordBestMatchWted
. hausdorffDistMod1to2Wted(Z,V)
will return the distance of set Z to set V. The distance(Z,V) is not the same as distance(V,Z) by definition of Hausdorff distance.
Finally, in the function hausdorffDistModWted
calls both hausdorffDistMod1to2Wted(Z,V)
and hausdorffDistMod1to2Wted(V,Z)
.
Hi datduong,
Thanks a lot for your time and kind explanation.
If I understand you correctly, Then the method you used to calculate information context is different with the one you mentioned in the paper.
And I know function hausdorffDistModWted calls both hausdorffDistMod1to2Wted(Z,V) and hausdorffDistMod1to2Wted(V,Z). However, it seemed that you didn't call function hausdorffDistModWted but call funtion hausdorffDistMod1to2Wted just once instead.
And for function findWordBestMatchWted, I notice you calculated a production term(pairWt = infoV1 * infoW1) as the weights. However, in the paper, it seems you used single information content directly. I am still a little puzzled abut this.
Looking forward to your reply.
You're correct. Now, I see why this is wrong. The main file w2vCompareGO.py
calls w2v2GoTerms
only once. For example in line 53, it should be saying
score = w2v2GoTerms ( pair[0], pair[1], annotationBP, hausdorffDistMod1to2Wted, model )
scoreBtoA = w2v2GoTerms ( pair[1], pair[0], annotationBP, hausdorffDistMod1to2Wted, model ) ## to reverse the pairing`
score = some-function-to-combine (score, scoreBtoA)
I will change the code. In the result section, I didn't use this script. I had another script that directly reads in the genes and the GO annotation. It was easier to just directly work at the gene-level instead of the GO-term level, because when doing gene-level, I was able to parallelize a lot of gene pairs. Also each gene pairs has only a very small 2x2 score matrix for their GO terms. This way, I did not need to compute a very large 2x2 score matrix for all the GO terms.
I use the function w2v2GoSetsArr
in the file func2getSimOf2GoTermsV2.py
. Here you will see that the inputs are sets of GO (not 2 single GO terms). I will post another code that works directly at the gene-level very soon.
Thanks for catching this problem.
Thanks so much for your reply. Now , I am totally understand you codes except one function. Just the the last question I raised in last comment. May I ask you why you use the information content production between two word as weights instead of information content of a single word as weights?
Hi Tao, again, you're correct. The equation using only content(z)
for word z
is wrong. It should be using both content(z) * content(w)
. I experimented with different ways to weigh the words. I found that using both content(z) * content(w)
gives me the best results. This strategy came from the observation that when I compare the blue car versus this cat is sleeping, I should have the weights so that weight(the,this) < weight(the,cat).
Thanks for catching this equation. I will fix the pdf.
Hi Datduong, thanks a lot for your patience and kind reply. It helps me lot!
No problem. Feel free to contact me if you need anything else. Thanks for trying the code.
On Tue, Apr 24, 2018 at 1:14 AM TaoDFang notifications@github.com wrote:
Hi Datduong, thanks a lot for your patience and kind reply. It helps me lot!
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/datduong/NLPMethods2CompareGOterms/issues/2#issuecomment-383844176, or mute the thread https://github.com/notifications/unsubscribe-auth/AJnTGEOfjxZovHWXz7ZiLNEmZwZIg1cDks5trt7jgaJpZM4Tbm8f .
Hi guys,
Can I ask you somethings about Word2vec model implementation? Here are algorithm you introduced in paper:
And below are key codes to implement this algorithm:
However, I am confused about the codes. First, I don't understand why function "infoContentOfWord " can return "-1np.log ( model.vocab[w1].count1.0 / model.corpus_count )". It seemed to me not right. Second, in equation(1), there are two terms in angle brackets. But in function "w2v2GoTerms", it seemed that only one term were calculated. And I checked the old version(2017), as I can see, both of these two terms are calculated.
Would you please explain a bit for this because I am really interested in your algorithm?