Closed matthias-samwald closed 6 years ago
thanks for the example! indeed the order must be by decreasing similarity. can you briefly check if this is because of the latest small change maybe, which might have modified the sorting? (you could checkout the code before the change)
It seems like nnSent functionality was added quite recently itself. Which commit should I use (i.e., which commit might have introduced the problem)?
@guptaprkhr might be more able to detect the error as I'm not familiar with this part of the code. But let's try. The function called to display the NNs for a query sentence is FastText::findNNSent
, in fasttext.cc:
void FastText::findNNSent(const Matrix& sentenceVectors,
const Vector& queryVec,
int32_t k,
const std::set<std::string>& banSet,
int64_t numSent,
const std::vector<std::string>& sentences) {
real queryNorm = queryVec.norm();
if (std::abs(queryNorm) < 1e-8) {
queryNorm = 1;
}
std::priority_queue<std::pair<real, std::string>> heap;
Vector vec(args_->dim);
for (int32_t i = 0; i < numSent; i++) { //Push things in the heap
std::string sentence = std::to_string(i) + " " + sentences[i];
real dp = sentenceVectors.dotRow(queryVec, i);
heap.push(std::make_pair(dp / queryNorm, sentence));
}
int32_t i = 0;
while (i < k && heap.size() > 0) { //pop elements from the heap and display
if (!std::isnan(heap.top().first)) {
std::cout << heap.top().first << " " << heap.top().second << std::endl;
i++;
}
heap.pop();
}
}
If the std::priority_queue
is properly implemented (safe assumption :P) then I don't see how the order can be mixed up. I'll investigate more and report here later tonight or tomorrow.
can you try at the point a week ago?
On Thu, Sep 7, 2017, 4:04 PM Matthias Samwald notifications@github.com wrote:
It seems like nnSent functionality was added quite recently itself. Which commit should I use (i.e., which commit might have introduced the problem)?
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/epfml/sent2vec/issues/10#issuecomment-327809465, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaGR6itIcj9Nm-_ruutLqiA1KXMsk4gks5sf_fTgaJpZM4PPrQK .
Hi Matthias, I tried the latest commit and to my surprise, the results I am getting are ordered properly. There must be some other local issue. In any case, try the latest commit and tell me if it works.
Problem still persists with the latest commit. Again, the ordering seems almost-okay for some queries, but very messy for most other queries (without any apparent pattern). I tried with both my own model and the sent2vec wiki bigrams model available for download. (EDIT: this is based on the most recently committed code)
Query sentence?
neuropathy can happen with or without the presence of diabetes mellitus .
0.826181 984594 autoimunity can develop spontaneously or can be induced experimentaly by immunization with autoantigens or antigensd croos - reactive with autoantigens .
0.829706 723894 a spaceflight can happen with or without humans on board .
0.826626 12765 aspect can combine with present or past tense .
0.824768 387428 normal hormone functions can cause or change the appearance of cellulite .
0.801093 596746 with muscle atrophy , muscles can waste away completely , or only partly .
0.799515 185099 end - stage renal failure can only be treated with dialysis or a kidney transplant .
0.794465 730533 catatonia can sometimes happen with mental disorders .
0.793723 709292 treatment of the condition can be done with either therapy or drugs .
0.776848 171750 hypertension can often be fixed with changes in diet or lifestyle .
0.774509 179722 priests can either heal with holy and discipline spells or do damage with shadow spells .
0.773723 289601 sometimes taking powerful or lots of laxatives can cause diarrhea or a flatulence .
0.773606 67574 having sex with an infected person , blood transfusions , or touching their blood or urine can sometimes pass on these conditions or diseases .
0.770759 753262 in severe cases associated with influenza a or b , theanti neuraminidase inhibitors can be given .
0.769895 1000230 even with no signs of infection , pid can cause serious harm .
0.816747 945454 albuterol can cause rapid heartbeat or tachycardia .
0.767183 579183 infections of the prostate , bladder , or seminal vesicles can lead to burning or itching sensations following ejaculation .
0.762116 1045495 taking these drugs with ssris can cause serotonin syndrome .
0.755276 340580 contact with the skin can lead to frostbite .
0.750921 194015 damage to the brain , or the use of certain drugs can cause amnesia .
0.746905 85902 other causes for abortions can be the infection of either the woman or embryo / fetus , or their immune response .
Query sentence?
there are other kinds of diabetes , like diabetes insipidus .
0.808359 860785 there are a few different types of oral cancers .
0.806687 848513 there are many different kinds of muscle diseases .
1 80107 there are other kinds of diabetes , like diabetes insipidus .
0.911153 81846 there are other kinds of medicines to treat hypercholesterolemia .
0.832822 86706 there are many types of kidney diseases .
0.835779 462640 there are other forms of struggle , too .
0.818577 798484 there are several causes of pernicious anemia .
0.812733 805485 there are different kinds of migraines .
0.808765 166460 there are standard drugs like chloromycetin which cure typhoid .
0.806561 85299 there are many diseases of nerves .
0.806189 42502 there are twenty different common types of amino acid .
0.80123 632377 there are various types of hernia .
0.800776 394185 there are other possible causes for adhd .
0.800123 698659 there are a lot of other prefixes .
0.861819 44557 there are other types of dangerous materials .
0.841768 547063 there are other types of steel .
0.837878 611912 there are several types of cerebral palsy .
0.831762 567089 there are many other kinds of sex .
0.81305 1784 there are different kinds of bubonic plague .
0.800299 572076 there are many forms of liver disease .
Query sentence?
slavery was eventually abolished in the united states .
0.909333 85751 the compromise was to delay the slavery issue in the united states .
0.90293 509002 it abolished slavery in the united states .
0.898187 715772 in 1990 , the pollution prevention act was passed in the united states .
0.888174 997943 later the trucking industry in the united states was added .
0.884522 82923 some founders wanted to abolish slavery everywhere in the united states .
0.866001 693020 damage in the northeastern united states was minor .
0.853135 858844 the divorce was recognized in the united states on august 26 , 1964 .
0.84733 1007958 the class action originated in the united states .
0.845343 955730 segregation in housing was a problem across the united states .
0.844979 509007 however , many groups sought to end slavery in the united states .
0.843993 506247 1 albums in the united states .
0.843776 858825 it was never released in the united states .
0.843058 616908 the moral majority was a political organization in the united states .
0.842843 955559 lammtarra was bred in kentucky , united states .
0.841149 767796 this drama was rated nc-17 in the united states .
0.832191 859450 634 , enacted september 9 , 1957 , was a law in the united states .
0.830943 966270 szcześniak now lives in the united states .
0.819046 950563 the united states supported galtieri .
0.817111 946822 he was raised in both taiwan and nigeria before settling in the united states .
0.814067 930285 henry halleck was a general in the united states army .
I tried to reproduce the issue, without success, the results are properly ordered using the latest commit. I'm joining @guptaprkhr in his conclusion that this might be a local issue. What's your setup ? Could it be a console / printing related issue ?
System information: Ubuntu 16.04 (64 bit, Kernel 4.4.0-93-generic) C++ version 5.4.0
I now also tried the commit BEFORE https://github.com/epfml/sent2vec/commit/99bd85de192a1f4ce1aff342fba3ab68878c1932 and found that query results are empty.
Query sentence? neuropathy can happen with or without the presence of diabetes mellitus .
Query sentence? neuropathy can happen with or without the presence of diabetes mellitus .
Query sentence?
we can't reproduce this unfortunately. did you clean the code when updating form git (it indeed didn't work earlier, but now does for us)
Hi, I have the exact same problem as matthias-samwald. The results are not ordered properly. I've cloned the repository only a few days ago. Unfortunately, I cannot show real examples, but this is the order I get back:
0.267163 995202
0.182761 989931
0.18209 989938
0.60233 28851996
0.581494 19254999
0.577039 19257290
0.577039 19134638
0.577039 19104486
0.575327 13409779
0.567145 19095098
thanks for the comment. i'm reopening the issue and we'll definitely check the sorting code again. this code part is actually mostly shared with fasttext, so i wonder if their nnSent also shares part of the issue
It seems that the dot product between the query vector and sentences vectors matrix is generating a lot of NaN, which is breaking the sort order.
A simple fix is just to check whether the resulting dot product is NaN and replace it with 0 for example. Not the right approach though, it would be better to check why the dot product is generating this load of false numbers.
I tried with both the wiki_bigrams
and wiki_unigrams
model, but I wasn't able to reproduce the problem. It could be a local issue, but nonetheless, I am looking more deeply into the code for finding the root cause.
@gugarosa If NaN's are the issue, I am wondering how @AnnikaBerger 's example still shows the similarity score. (I might be missing something, please let me know).
@AnnikaBerger Please check the comment on the pull request that I am making.
I will be looking more into this. (Also, if you can share a tiny portion of corpus on which it produces a wrong result, that will help a great deal in verifying local issue)
Thanks!
I think the incorrect order of the elements in the priority queue is related to NaN's (which at least also appear when I use the nnSent method with my data). NaN's are not shown in the final output, because they are excluded by the if statement. I think my fix also doesn't work correctly, even if it seemed to return correct result for my data. You can close the pull request. Unfortunately, I cannot provide real data, sorry.
thanks! is there any intuition why it would give any NaNs at all, for some inputs? i don't recall we ever observed a NaN. and still the values in the ordering are not NaN but wrong order by a large margin...
as it's a small example you had, maybe you could share a dummy text input which would also cause the problem?
I could reproduce the incorrect ordering by pushing the following values in the below order to your priority queue: real scores[9] = {0.182761, 0.0, 0.577039, 0.0, 0.60233, 0.577039, 0.756321, 0.0, 0.165438};
You can produce NaN's for positions 1, 3 and 7 by dividing 0.0/0.0.
I get back the following (wrong) ordering:
0.756321 6
0.60233 4
0.577039 5
0.577039 2
0.165438 8
0.182761 0
I don't know where the NaN's come from. But once you have a NaN either in the query vector or in the matrix representing the corpora, NaN's appear as similarity scores. Maybe NaN's can be replaced by zeros before dot product of query vector and corpora vector is computed?! Since you explicitly check for NaN's in fasttext.cc line 536 (which is not part of the original fasttext nn method) you seem to have also observed NaN's before.
ok, I think I found the issue, if you look at the following code:
void FastText::precomputeSentenceVectors(Matrix& sentenceVectors,std::ifstream& in) {
Vector vec(args_->dim);
sentenceVectors.zero();
std::cerr << "Pre-computing sentence vectors...";
std::vector<int32_t> line;
std::vector<int32_t> labels;
int32_t i = 0;
while (i < sentenceVectors.m_) {
dict_->getLine(in, line, labels, model_->rng);
dict_->addNgrams(line, args_->wordNgrams);
vec.zero();
for (auto it = line.cbegin(); it != line.cend(); ++it) {
vec.addRow(*input_, *it);
}
if (!line.empty()) {
vec.mul(1.0 / line.size());
}
real norm = vec.norm();
sentenceVectors.addRow(vec, i, 1.0 / norm); <<<<<<<<< Here
i++;
}
std::cerr << " done." << std::endl;
}
It normalizes the sentence vectors without checking for the norm to be non-zero. I think the problem comes from from your bank of sentences containing either empty sentences or sentences with only out of vocabulary forms. Either ways, this causes NaN values and weird ordering. I reproduced the weird ordering using a empty sentences and committed a fix. Can you try to pull and confirm the ordering is now correct?
Yes, thanks, for me it is working now.
Cool :) ! In the fix I wrote I do not remove the empty vectors from the bank. For those vectors the inner product with the query vector will be zero, which is wrong. The cosine similarity is not defined for zero vectors. For this reason it is better to preprocess the sentences file and remove empty lines.
thanks a lot for fixing this! and returning zero in these cases is actually fine, and a good convention
I observed that sometimes the cosine distances of the results differ for the same query sentence.
I try to apply the pretrained wikipedia bi-gram model on an existing corpus i have (Simple English wikipedia sentences, tokenized and lowercased with another pipeline). I tried the nnSent command to see how well we can retrieve similar sentences, and I get weird/underwhelming results.
With most queries I try, the results are not (or only roughly) ordered by vector distance:
Query sentence? neuropathy can happen with or without the presence of diabetes mellitus .
0.709152 994968 mody is often called monogenic diabetes .
0.659386 994966 maturity onset diabetes of the young , or mody is any of several hereditary forms of diabetes mellitus caused by gene mutations .
0.608413 994574 he died on may 18 , 2009 of complications of diabetes .
0.606619 994651 furst has had type 2 diabetes for many years .
0.679047 949868 it is an important part of blood glucose checking for people with diabetes mellitus or hypoglycemia .
1 842587 neuropathy can happen with or without the presence of diabetes mellitus .
0.710052 860388 neonatal diabetes mellitus ( ndm ) is a type of diabetes .
0.744731 80109 people with diabetes " mellitus " are called " diabetics " .
0.722052 80107 there are other kinds of diabetes , like diabetes insipidus . 0.719025 80132 gestational diabetes mellitus is like type 2 diabetes .
Query sentence? there are other kinds of diabetes , like diabetes insipidus . 0.794162 994968 mody is often called monogenic diabetes .
0.793908 949871 the tests are used for type 1 diabetes , latent autoimmune diabetes of adults and type 2 diabetes .
0.792295 950147 he has type 1 diabetes .
0.652549 908504 this happened after many years of fighting with diabetes .
0.722051 842587 neuropathy can happen with or without the presence of diabetes mellitus .
1 80107 there are other kinds of diabetes , like diabetes insipidus .
0.874403 80132 gestational diabetes mellitus is like type 2 diabetes .
0.812321 80109 people with diabetes " mellitus " are called " diabetics " .
0.809479 647689 type 2 diabetes makes up around 90% of cases of diabetes , while type 1 diabetes and other types of diabetes make up the other 10% .
0.798721 80135 in the case of diabetes , there are two kinds of complications .
For some queries, results seem almost ordered (but still not perfectly so):
Query sentence? slavery was eventually abolished in the united states . 0.853386 983766 slavery was abolished in 1851 .
0.819932 94713 the institution of slavery was abolished in 1925 .
0.872842 614734 sharecropping became more common in the southern united states when slavery was abolished .
0.765797 67662 following the war , slavery was outlawed everywhere in the united states .
0.76175 11557 slavery in united states was legally abolished by thirteenth amendment to the united states constitution in 1865 .
0.76154 82923 some founders wanted to abolish slavery everywhere in the united states .
0.751806 90399 the society would get slavery abolished in new york .
0.726364 712056 this gave slaves their freedom , even though slavery had been officially abolished in 1807 .
0.702602 85751 the compromise was to delay the slavery issue in the united states .
0.686827 1002030 meanwhile , some states still had slavery .
0.68088 837048 it was abolished in 1968 .
0.680816 842166 it was abolished in 1969 .
0.674035 897590 in 1784 she wrote a play that supported slavery being abolished ( gotten rid of ) .
0.67191 912262 it was abolished in 1974 .
0.667849 1001457 it was abolished in 1949 .