Synonyms shouldn't show up before actual matches

melyux commented 1 year ago

Right now, if a word has both an exact definition and a synonym link to another word, the synonym is given before the actual match. Synonyms for a given word should always come after the main definitions, if they exist, at least within the same dictionary, since they're always secondary to the main definition(s).

NiLuJe commented 1 year ago

This fixes it (by checking headwords before synomyms, and using a deque to preserve insertion order), but I have no idea if this even makes sense in real-world use-cases (I mean, checking headwords before synonyms sure seems to naïvely make sense, at least...).

Ping @cyphar who worked on this very codepath most recently, and is surely more familiar with this sort of things (I basically just saw the order of the lookup calls and went "hmm". I have no idea how any of this works :D).

diff --git a/src/libwrapper.cpp b/src/libwrapper.cpp
index 6be59c5..dc5537d 100644
--- a/src/libwrapper.cpp
+++ b/src/libwrapper.cpp
@@ -199,11 +199,13 @@ static std::string parse_data(const gchar *data, bool colorize_output)

 void Library::SimpleLookup(const std::string &str, TSearchResultList &res_list)
 {
-    std::set<glong> wordIdxs;
+    std::deque<glong> wordIdxs; // A deque to preserve insertion order (headwords before synonyms)
+    std::set<glong> wordIdxsSet; // Paired with a set to prevent duplicate entries
     res_list.reserve(ndicts());
     for (gint idict = 0; idict < ndicts(); ++idict) {
         wordIdxs.clear();
-        if (SimpleLookupWord(str.c_str(), wordIdxs, idict))
+        wordIdxsSet.clear();
+        if (SimpleLookupWord(str.c_str(), wordIdxs, wordIdxsSet, idict))
             for (auto &wordIdx : wordIdxs)
                 res_list.push_back(
                     TSearchResult(dict_name(idict),
diff --git a/src/stardict_lib.cpp b/src/stardict_lib.cpp
index 83fbc59..d96df07 100644
--- a/src/stardict_lib.cpp
+++ b/src/stardict_lib.cpp
@@ -436,7 +436,7 @@ public:
     {
         return get_key(idx);
     }
-    bool lookup(const char *str, std::set<glong> &idxs, glong &next_idx) override;
+    bool lookup(const char *str, std::deque<glong> &idxs, std::set<glong> &idxs_set, glong &next_idx) override;

 private:
     static const gint ENTR_PER_PAGE = 32;
@@ -497,7 +497,7 @@ public:
         get_data(idx);
         return get_key(idx);
     }
-    bool lookup(const char *str, std::set<glong> &idxs, glong &next_idx) override;
+    bool lookup(const char *str, std::deque<glong> &idxs, std::set<glong> &idxs_set, glong &next_idx) override;

 private:
     gchar *idxdatabuf;
@@ -684,7 +684,7 @@ const gchar *OffsetIndex::get_key(glong idx)
     return page.entries[idx_in_page].keystr;
 }

-bool OffsetIndex::lookup(const char *str, std::set<glong> &idxs, glong &next_idx)
+bool OffsetIndex::lookup(const char *str, std::deque<glong> &idxs, std::set<glong> &idxs_set, glong &next_idx)
 {
     bool bFound = false;

@@ -749,11 +749,22 @@ bool OffsetIndex::lookup(const char *str, std::set<glong> &idxs, glong &next_idx
         // In order to return all idxs that match the search string, walk
         // linearly behind and ahead of the found index.
         glong iHeadIndex = iThisIndex - 1; // do not include iThisIndex
-        while (iHeadIndex >= 0 && stardict_strcmp(str, get_key(iHeadIndex)) == 0)
-            idxs.insert(iHeadIndex--);
-        do // no need to double-check iThisIndex -- we know it's a match already
-            idxs.insert(iThisIndex++);
-        while (iThisIndex <= real_last.idx && stardict_strcmp(str, get_key(iThisIndex)) == 0);
+        while (iHeadIndex >= 0 && stardict_strcmp(str, get_key(iHeadIndex)) == 0) {
+            const glong idx = iHeadIndex--;
+            // Prevent duplicates by checking the set first
+            if (idxs_set.insert(idx).second) {
+                idxs.push_back(idx);
+                printf("OffsetIndex::lookup Inserted iHeadIndex %ld\n", iHeadIndex);
+            }
+        }
+        do {
+            // no need to double-check iThisIndex -- we know it's a match already
+            const glong idx = iThisIndex++;
+            if (idxs_set.insert(idx).second) {
+                idxs.push_back(idx);
+                printf("OffsetIndex::lookup Inserted iThisIndex %ld\n", iThisIndex);
+            }
+        } while (iThisIndex <= real_last.idx && stardict_strcmp(str, get_key(iThisIndex)) == 0);
     }
     return bFound;
 }
@@ -794,7 +805,7 @@ void WordListIndex::get_data(glong idx)
     wordentry_size = g_ntohl(get_uint32(p1));
 }

-bool WordListIndex::lookup(const char *str, std::set<glong> &idxs, glong &next_idx)
+bool WordListIndex::lookup(const char *str, std::deque<glong> &idxs, std::set<glong> &idxs_set, glong &next_idx)
 {
     bool bFound = false;
     glong iLast = wordlist.size() - 2;
@@ -825,11 +836,21 @@ bool WordListIndex::lookup(const char *str, std::set<glong> &idxs, glong &next_i
             // In order to return all idxs that match the search string, walk
             // linearly behind and ahead of the found index.
             glong iHeadIndex = iThisIndex - 1; // do not include iThisIndex
-            while (iHeadIndex >= 0 && stardict_strcmp(str, get_key(iHeadIndex)) == 0)
-                idxs.insert(iHeadIndex--);
-            do // no need to double-check iThisIndex -- we know it's a match already
-                idxs.insert(iThisIndex++);
-            while (iThisIndex <= iLast && stardict_strcmp(str, get_key(iThisIndex)) == 0);
+            while (iHeadIndex >= 0 && stardict_strcmp(str, get_key(iHeadIndex)) == 0) {
+                const glong idx = iHeadIndex--;
+                if (idxs_set.insert(idx).second) {
+                    idxs.push_back(idx);
+                    printf("WordListIndex::lookup Inserted iHeadIndex %ld\n", iHeadIndex);
+                }
+            }
+            do {
+                // no need to double-check iThisIndex -- we know it's a match already
+                const glong idx = iThisIndex++;
+                if (idxs_set.insert(idx).second) {
+                    idxs.push_back(idx);
+                    printf("WordListIndex::lookup Inserted iHeadIndex %ld\n", iHeadIndex);
+                }
+            } while (iThisIndex <= iLast && stardict_strcmp(str, get_key(iThisIndex)) == 0);
         }
     }
     return bFound;
@@ -863,7 +884,7 @@ bool SynFile::load(const std::string &url, gulong wc)
     }
 }

-bool SynFile::lookup(const char *str, std::set<glong> &idxs, glong &next_idx)
+bool SynFile::lookup(const char *str, std::deque<glong> &idxs, std::set<glong> &idxs_set, glong &next_idx)
 {
     bool bFound = false;
     glong iLast = synlist.size() - 2;
@@ -898,23 +919,31 @@ bool SynFile::lookup(const char *str, std::set<glong> &idxs, glong &next_idx)
             glong iHeadIndex = iThisIndex - 1; // do not include iThisIndex
             while (iHeadIndex >= 0 && stardict_strcmp(str, get_key(iHeadIndex)) == 0) {
                 const gchar *key = get_key(iHeadIndex--);
-                idxs.insert(g_ntohl(get_uint32(key + strlen(key) + 1)));
+                const glong idx = g_ntohl(get_uint32(key + strlen(key) + 1));
+                if (idxs_set.insert(idx).second) {
+                    idxs.push_back(idx);
+                    printf("SynFile::lookup Inserted iHeadIndex %ld\n", idx);
+                }
             }
             do {
                 // no need to double-check iThisIndex -- we know it's a match already
                 const gchar *key = get_key(iThisIndex++);
-                idxs.insert(g_ntohl(get_uint32(key + strlen(key) + 1)));
+                const glong idx = g_ntohl(get_uint32(key + strlen(key) + 1));
+                if (idxs_set.insert(idx).second) {
+                    idxs.push_back(idx);
+                    printf("SynFile::lookup Inserted iHeadIndex %ld\n", idx);
+                }
             } while (iThisIndex <= iLast && stardict_strcmp(str, get_key(iThisIndex)) == 0);
         }
     }
     return bFound;
 }

-bool Dict::Lookup(const char *str, std::set<glong> &idxs, glong &next_idx)
+bool Dict::Lookup(const char *str, std::deque<glong> &idxs, std::set<glong> &idxs_set, glong &next_idx)
 {
     bool found = false;
-    found |= syn_file->lookup(str, idxs, next_idx);
-    found |= idx_file->lookup(str, idxs, next_idx);
+    found |= idx_file->lookup(str, idxs, idxs_set, next_idx);
+    found |= syn_file->lookup(str, idxs, idxs_set, next_idx);
     return found;
 }

@@ -1023,7 +1052,7 @@ void Libs::load(const std::list<std::string> &dicts_dirs,
                   });
 }

-bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices, int iLib)
+bool Libs::LookupSimilarWord(const gchar *sWord, std::deque<glong> &iWordIndices, std::set<glong> &iWordIndicesSet, int iLib)
 {
     bool bFound = false;
     gchar *casestr;
@@ -1032,7 +1061,7 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
         // to lower case.
         casestr = g_utf8_strdown(sWord, -1);
         if (strcmp(casestr, sWord)) {
-            if (oLib[iLib]->Lookup(casestr, iWordIndices))
+            if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                 bFound = true;
         }
         g_free(casestr);
@@ -1040,7 +1069,7 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
         if (!bFound) {
             casestr = g_utf8_strup(sWord, -1);
             if (strcmp(casestr, sWord)) {
-                if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                     bFound = true;
             }
             g_free(casestr);
@@ -1054,7 +1083,7 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
             g_free(firstchar);
             g_free(nextchar);
             if (strcmp(casestr, sWord)) {
-                if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                     bFound = true;
             }
             g_free(casestr);
@@ -1074,12 +1103,12 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
             if (isupcase || sWord[iWordLen - 1] == 's' || !strncmp(&sWord[iWordLen - 2], "ed", 2)) {
                 strcpy(sNewWord, sWord);
                 sNewWord[iWordLen - 1] = '\0'; // cut "s" or "d"
-                if (oLib[iLib]->Lookup(sNewWord, iWordIndices))
+                if (oLib[iLib]->Lookup(sNewWord, iWordIndices, iWordIndicesSet))
                     bFound = true;
                 else if (isupcase || g_ascii_isupper(sWord[0])) {
                     casestr = g_ascii_strdown(sNewWord, -1);
                     if (strcmp(casestr, sNewWord)) {
-                        if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                        if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                             bFound = true;
                     }
                     g_free(casestr);
@@ -1097,13 +1126,13 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
                     && !bIsVowel(sNewWord[iWordLen - 4]) && bIsVowel(sNewWord[iWordLen - 5])) { // doubled

                     sNewWord[iWordLen - 3] = '\0';
-                    if (oLib[iLib]->Lookup(sNewWord, iWordIndices))
+                    if (oLib[iLib]->Lookup(sNewWord, iWordIndices, iWordIndicesSet))
                         bFound = true;
                     else {
                         if (isupcase || g_ascii_isupper(sWord[0])) {
                             casestr = g_ascii_strdown(sNewWord, -1);
                             if (strcmp(casestr, sNewWord)) {
-                                if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                                if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                                     bFound = true;
                             }
                             g_free(casestr);
@@ -1113,12 +1142,12 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
                     }
                 }
                 if (!bFound) {
-                    if (oLib[iLib]->Lookup(sNewWord, iWordIndices))
+                    if (oLib[iLib]->Lookup(sNewWord, iWordIndices, iWordIndicesSet))
                         bFound = true;
                     else if (isupcase || g_ascii_isupper(sWord[0])) {
                         casestr = g_ascii_strdown(sNewWord, -1);
                         if (strcmp(casestr, sNewWord)) {
-                            if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                            if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                                 bFound = true;
                         }
                         g_free(casestr);
@@ -1136,13 +1165,13 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
                 if (iWordLen > 6 && (sNewWord[iWordLen - 4] == sNewWord[iWordLen - 5])
                     && !bIsVowel(sNewWord[iWordLen - 5]) && bIsVowel(sNewWord[iWordLen - 6])) { // doubled
                     sNewWord[iWordLen - 4] = '\0';
-                    if (oLib[iLib]->Lookup(sNewWord, iWordIndices))
+                    if (oLib[iLib]->Lookup(sNewWord, iWordIndices, iWordIndicesSet))
                         bFound = true;
                     else {
                         if (isupcase || g_ascii_isupper(sWord[0])) {
                             casestr = g_ascii_strdown(sNewWord, -1);
                             if (strcmp(casestr, sNewWord)) {
-                                if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                                if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                                     bFound = true;
                             }
                             g_free(casestr);
@@ -1152,12 +1181,12 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
                     }
                 }
                 if (!bFound) {
-                    if (oLib[iLib]->Lookup(sNewWord, iWordIndices))
+                    if (oLib[iLib]->Lookup(sNewWord, iWordIndices, iWordIndicesSet))
                         bFound = true;
                     else if (isupcase || g_ascii_isupper(sWord[0])) {
                         casestr = g_ascii_strdown(sNewWord, -1);
                         if (strcmp(casestr, sNewWord)) {
-                            if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                            if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                                 bFound = true;
                         }
                         g_free(casestr);
@@ -1168,12 +1197,12 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
                         strcat(sNewWord, "E"); // add a char "E"
                     else
                         strcat(sNewWord, "e"); // add a char "e"
-                    if (oLib[iLib]->Lookup(sNewWord, iWordIndices))
+                    if (oLib[iLib]->Lookup(sNewWord, iWordIndices, iWordIndicesSet))
                         bFound = true;
                     else if (isupcase || g_ascii_isupper(sWord[0])) {
                         casestr = g_ascii_strdown(sNewWord, -1);
                         if (strcmp(casestr, sNewWord)) {
-                            if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                            if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                                 bFound = true;
                         }
                         g_free(casestr);
@@ -1188,12 +1217,12 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
             if (isupcase || (!strncmp(&sWord[iWordLen - 2], "es", 2) && (sWord[iWordLen - 3] == 's' || sWord[iWordLen - 3] == 'x' || sWord[iWordLen - 3] == 'o' || (iWordLen > 4 && sWord[iWordLen - 3] == 'h' && (sWord[iWordLen - 4] == 'c' || sWord[iWordLen - 4] == 's'))))) {
                 strcpy(sNewWord, sWord);
                 sNewWord[iWordLen - 2] = '\0';
-                if (oLib[iLib]->Lookup(sNewWord, iWordIndices))
+                if (oLib[iLib]->Lookup(sNewWord, iWordIndices, iWordIndicesSet))
                     bFound = true;
                 else if (isupcase || g_ascii_isupper(sWord[0])) {
                     casestr = g_ascii_strdown(sNewWord, -1);
                     if (strcmp(casestr, sNewWord)) {
-                        if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                        if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                             bFound = true;
                     }
                     g_free(casestr);
@@ -1210,13 +1239,13 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
                 if (iWordLen > 5 && (sNewWord[iWordLen - 3] == sNewWord[iWordLen - 4])
                     && !bIsVowel(sNewWord[iWordLen - 4]) && bIsVowel(sNewWord[iWordLen - 5])) { // doubled
                     sNewWord[iWordLen - 3] = '\0';
-                    if (oLib[iLib]->Lookup(sNewWord, iWordIndices))
+                    if (oLib[iLib]->Lookup(sNewWord, iWordIndices, iWordIndicesSet))
                         bFound = true;
                     else {
                         if (isupcase || g_ascii_isupper(sWord[0])) {
                             casestr = g_ascii_strdown(sNewWord, -1);
                             if (strcmp(casestr, sNewWord)) {
-                                if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                                if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                                     bFound = true;
                             }
                             g_free(casestr);
@@ -1226,12 +1255,12 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
                     }
                 }
                 if (!bFound) {
-                    if (oLib[iLib]->Lookup(sNewWord, iWordIndices))
+                    if (oLib[iLib]->Lookup(sNewWord, iWordIndices, iWordIndicesSet))
                         bFound = true;
                     else if (isupcase || g_ascii_isupper(sWord[0])) {
                         casestr = g_ascii_strdown(sNewWord, -1);
                         if (strcmp(casestr, sNewWord)) {
-                            if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                            if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                                 bFound = true;
                         }
                         g_free(casestr);
@@ -1250,12 +1279,12 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
                     strcat(sNewWord, "Y"); // add a char "Y"
                 else
                     strcat(sNewWord, "y"); // add a char "y"
-                if (oLib[iLib]->Lookup(sNewWord, iWordIndices))
+                if (oLib[iLib]->Lookup(sNewWord, iWordIndices, iWordIndicesSet))
                     bFound = true;
                 else if (isupcase || g_ascii_isupper(sWord[0])) {
                     casestr = g_ascii_strdown(sNewWord, -1);
                     if (strcmp(casestr, sNewWord)) {
-                        if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                        if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                             bFound = true;
                     }
                     g_free(casestr);
@@ -1273,12 +1302,12 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
                     strcat(sNewWord, "Y"); // add a char "Y"
                 else
                     strcat(sNewWord, "y"); // add a char "y"
-                if (oLib[iLib]->Lookup(sNewWord, iWordIndices))
+                if (oLib[iLib]->Lookup(sNewWord, iWordIndices, iWordIndicesSet))
                     bFound = true;
                 else if (isupcase || g_ascii_isupper(sWord[0])) {
                     casestr = g_ascii_strdown(sNewWord, -1);
                     if (strcmp(casestr, sNewWord)) {
-                        if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                        if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                             bFound = true;
                     }
                     g_free(casestr);
@@ -1292,12 +1321,12 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
             if (isupcase || (!strncmp(&sWord[iWordLen - 2], "er", 2))) {
                 strcpy(sNewWord, sWord);
                 sNewWord[iWordLen - 2] = '\0';
-                if (oLib[iLib]->Lookup(sNewWord, iWordIndices))
+                if (oLib[iLib]->Lookup(sNewWord, iWordIndices, iWordIndicesSet))
                     bFound = true;
                 else if (isupcase || g_ascii_isupper(sWord[0])) {
                     casestr = g_ascii_strdown(sNewWord, -1);
                     if (strcmp(casestr, sNewWord)) {
-                        if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                        if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                             bFound = true;
                     }
                     g_free(casestr);
@@ -1311,12 +1340,12 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
             if (isupcase || (!strncmp(&sWord[iWordLen - 3], "est", 3))) {
                 strcpy(sNewWord, sWord);
                 sNewWord[iWordLen - 3] = '\0';
-                if (oLib[iLib]->Lookup(sNewWord, iWordIndices))
+                if (oLib[iLib]->Lookup(sNewWord, iWordIndices, iWordIndicesSet))
                     bFound = true;
                 else if (isupcase || g_ascii_isupper(sWord[0])) {
                     casestr = g_ascii_strdown(sNewWord, -1);
                     if (strcmp(casestr, sNewWord)) {
-                        if (oLib[iLib]->Lookup(casestr, iWordIndices))
+                        if (oLib[iLib]->Lookup(casestr, iWordIndices, iWordIndicesSet))
                             bFound = true;
                     }
                     g_free(casestr);
@@ -1336,11 +1365,11 @@ bool Libs::LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices,
     return bFound;
 }

-bool Libs::SimpleLookupWord(const gchar *sWord, std::set<glong> &iWordIndices, int iLib)
+bool Libs::SimpleLookupWord(const gchar *sWord, std::deque<glong> &iWordIndices, std::set<glong> &iWordIndicesSet, int iLib)
 {
-    bool bFound = oLib[iLib]->Lookup(sWord, iWordIndices);
+    bool bFound = oLib[iLib]->Lookup(sWord, iWordIndices, iWordIndicesSet);
     if (!bFound && fuzzy_)
-        bFound = LookupSimilarWord(sWord, iWordIndices, iLib);
+        bFound = LookupSimilarWord(sWord, iWordIndices, iWordIndicesSet, iLib);
     return bFound;
 }

diff --git a/src/stardict_lib.hpp b/src/stardict_lib.hpp
index bb5c4de..19c6023 100644
--- a/src/stardict_lib.hpp
+++ b/src/stardict_lib.hpp
@@ -1,11 +1,13 @@
 #pragma once

 #include <cstring>
+#include <deque>
 #include <functional>
 #include <list>
 #include <memory>
 #include <set>
 #include <string>
+#include <utility>
 #include <vector>

 #include "dictziplib.hpp"
@@ -95,11 +97,11 @@ public:
     virtual const gchar *get_key(glong idx) = 0;
     virtual void get_data(glong idx) = 0;
     virtual const gchar *get_key_and_data(glong idx) = 0;
-    virtual bool lookup(const char *str, std::set<glong> &idxs, glong &next_idx) = 0;
-    virtual bool lookup(const char *str, std::set<glong> &idxs)
+    virtual bool lookup(const char *str, std::deque<glong> &idxs, std::set<glong> &idxs_set, glong &next_idx) = 0;
+    virtual bool lookup(const char *str, std::deque<glong> &idxs, std::set<glong> &idxs_set)
     {
         glong unused_next_idx;
-        return lookup(str, idxs, unused_next_idx);
+        return lookup(str, idxs, idxs_set, unused_next_idx);
     };
 };

@@ -109,8 +111,8 @@ public:
     SynFile() {}
     ~SynFile() {}
     bool load(const std::string &url, gulong wc);
-    bool lookup(const char *str, std::set<glong> &idxs, glong &next_idx);
-    bool lookup(const char *str, std::set<glong> &idxs);
+    bool lookup(const char *str, std::deque<glong> &idxs, std::set<glong> &idxs_set, glong &next_idx);
+    bool lookup(const char *str, std::deque<glong> &idxs, std::set<glong> &idxs_set);
     const gchar *get_key(glong idx) { return synlist[idx]; }

 private:
@@ -142,11 +144,11 @@ public:
         *offset = idx_file->wordentry_offset;
         *size = idx_file->wordentry_size;
     }
-    bool Lookup(const char *str, std::set<glong> &idxs, glong &next_idx);
-    bool Lookup(const char *str, std::set<glong> &idxs)
+    bool Lookup(const char *str, std::deque<glong> &idxs, std::set<glong> &idxs_set, glong &next_idx);
+    bool Lookup(const char *str, std::deque<glong> &idxs, std::set<glong> &idxs_set)
     {
         glong unused_next_idx;
-        return Lookup(str, idxs, unused_next_idx);
+        return Lookup(str, idxs, idxs_set, unused_next_idx);
     }

     bool LookupWithRule(GPatternSpec *pspec, glong *aIndex, int iBuffLen);
@@ -195,12 +197,12 @@ public:
             return nullptr;
         return oLib[iLib]->get_data(iIndex);
     }
-    bool LookupWord(const gchar *sWord, std::set<glong> &iWordIndices, int iLib)
+    bool LookupWord(const gchar *sWord, std::deque<glong> &iWordIndices, std::set<glong> &iWordIndicesSet, int iLib)
     {
-        return oLib[iLib]->Lookup(sWord, iWordIndices);
+        return oLib[iLib]->Lookup(sWord, iWordIndices, iWordIndicesSet);
     }
-    bool LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices, int iLib);
-    bool SimpleLookupWord(const gchar *sWord, std::set<glong> &iWordIndices, int iLib);
+    bool LookupSimilarWord(const gchar *sWord, std::deque<glong> &iWordIndices, std::set<glong> &iWordIndicesSet, int iLib);
+    bool SimpleLookupWord(const gchar *sWord, std::deque<glong> &iWordIndices, std::set<glong> &iWordIndicesSet, int iLib);

     bool LookupWithFuzzy(const gchar *sWord, gchar *reslist[], gint reslist_size);
     gint LookupWithRule(const gchar *sWord, gchar *reslist[]);

NiLuJe commented 1 year ago

I have no idea how the indices are assigned to begin with, for instance. If there was a way to simply make sure synonyms always have indices greater than headwords, this would be a much simpler patch ;).

And if duplicates are even a valid concern? IIRC, everything is sorted, so I don't even know how collisions could even happen?

cyphar commented 1 year ago

Indices are simply the index of the headword in the index file (which is sorted "alphabetically" to allow for binary searching). You can't really control it. And the "index" of synonyms is just the index in the index file (after resolving the synonym) so you can't differentiate the two based on the index alone.

Duplicates are a concern -- you can have dictionaries which have a synonym which points to a headword that is identical to the original string, and you don't want to be giving the same definition multiple times (previously sdcv would only give one result, and changing it to produce multiple results resulted in some entries showing up multiple times with some dictionaries). You can also hit the same issue with fuzzy searching (though it does in general stop once it's found something so it won't happen unless you have a particularly strange dictionary).

@melyux Are you sure that it's the case that synonyms are always before the headword? Results should be ordered by the index number which means that if the headword is earlier in the dictionary (i.e. it is earlier in the sorted list) it would show up first. std::set iteration is ordered, after all. I can try to make a dictionary which tests this (a word "a" which is a synonym of "z" as well). _EDIT: I just tested this, and I can confirm it's always alphabetical order, regardless of whether the first result found was a synonym or not. In fact, in the test_multiple_results dictionary included in this repository, you can verify for yourself that if you search for cat you get the headword before the synonyms because cat comes earlier in the sorted dictionary._

@NiLuJe As for the patch, I suspect that a much simpler solution would be to just change Dict::Lookup to do the regular lookup first, create a list of the output, then do the same with the synonym list and append it to the first list (we will probably have to change the signature of Dict::Lookup since it currently operates on a std::set -- though we might be able to do this without changing the signature with some std::set<Compare=> trickery, possibly by returning a set of (index, bool synonym) tuples?). This would explicitly add the requirement that the synonyms must come after matching headwords, without having to change any of the other logic.

Also this patch causes some entries to be ordered the wrong way in the resulting output (if we want sorted output as we have today -- when iterating over candidate entries we iterate over entries before our index in reverse order (this is done to reduce the overhead to be as minimal as possible) but this means you'll end up with entries being in a different order to the dictionary. The approach I propose above avoids this issue too.

NiLuJe commented 1 year ago

Keep in mind I don't generally do C++, so the Compare stuff flies a bit over my head, but I think I got the gist of the rest, thanks ;).

I'll see if I can play some more with this this week-end.

NiLuJe commented 1 year ago

Results should be ordered by the index number which means that if the headword is earlier in the dictionary (i.e. it is earlier in the sorted list) it would show up first.

Yep, the testcase I landed on was daemon as a synonym of demon, it sorts earlier, and had a lower index, but I wasn't sure what the relationship between those two facts was, thanks!

cyphar commented 1 year ago

I don't do C++ much either. :wink: (And I suspect that idea wouldn't have worked. :sweat_smile:)

I'm already working on a patch for this. The main hiccup is that the trivial way of fixing it (switching the top-level Lookup methods to operate on a vector) may lead to duplicates if you call Lookup multiple times (which fuzzy-search might do inadvertently, though it seems likely it won't do it). What would be really nice is an equivalent to Python's OrderedDict (except as a set) so we can get insertion-ordered iteration but I guess I could just write a simple version of that structure...

melyux commented 1 year ago

You're correct, synonyms aren't always before the headword, it seems to be just alphabetical. Good looking out. The thinking goes: each entry is primarily associated with a single spelling, its own headword, and in most cases is more closely associated with that spelling than to other equivalent ones. You'd want to exhaust that word's exact matches before moving onto showing entries it's only secondarily associated with, i.e. synonyms.

e.g. "shill":

noun "A decoy, an accomplice..."
adjective "Shrill, resonant...
verb "Separate..."
verb "Resound, be loud..."
verb "Act as a shill or accomplice..."
synonym → "sheel" verb "Take off the husk, to shell"

So right now "sheel" is given as the first definition when you search "shill" because it's alphabetically ahead of the "shill" spelling itself, even though it's only an afterthought compared to the primaries, which are much more closely bound to that spelling, as evinced by them being primaries! I think this is the case for most dictionaries.

cyphar commented 1 year ago

@melyux Yeah, I agree this is an issue. I'm working on a patch for this.

Dushistov / sdcv

Synonyms shouldn't show up before actual matches #89