num_entries_counter in mdict.cc is incorrect in mdx version 2.0

terasum commented 3 years ago

@hehonghui Can you provide further information about this? mdx file or something?

terasum commented 3 years ago

@hehonghui 你可以提供原始字典文件以及报错信息吗？

hehonghui commented 3 years ago

@hehonghui 你可以提供原始字典文件以及报错信息吗？

terasum 感谢回复, 我的测试环境是如下:

机器: Macbook Pro 2015
系统: Mac OS X 10.15.7
mdx文件: 新牛津英汉双解大词典-孤雷排版-2011.8.29.mdx

错误信息如下:

executable file name: ./bin/mydict
../mdict-analysis/xnj_jinshan.mdx

Assertion failed: (num_entries_counter == this->entries_num), function decode_key_block_info, file /Users/mrsimple/dev/github/mdict-cpp/mdict.cc, line 1246.
[1]    72930 abort      ./bin/mydict ../mdict-analysis/xnj_jinshan.mdx hello

初步判断是大、小端字节序的转换问题，我的机器上应该是小端字节顺序。

terasum commented 3 years ago

@hehonghui 你的环境是 powerpc 的CPU吗？大小端字节序应该有做处理，已经是平台无关的，可能是在 byte 转 int的时候出现问题了，这个我本地测试一下

hehonghui commented 3 years ago

我的设备应该是这款CPU 英特尔 I5-5257u ，https://ark.intel.com/content/www/cn/zh/ark/products/84985/intel-core-i5-5257u-processor-3m-cache-up-to-3-10-ghz.html 。

hehonghui commented 3 years ago

@terasum 新年好，有更新的话可提上来，我可以帮测试一下，另外 goldendict 项目中有QT版的mdx解析实现，可以参考一下。 :)

terasum commented 3 years ago

@hehonghui 抱歉春节没有时间看，近期会修复一下，非常感谢提醒

terasum commented 3 years ago

@hehonghui 我周末看了一下代码，你的这个字典是比较老的版本的格式，我这边代码没有实现这部分逻辑，需要花点时间调试一下

hehonghui commented 3 years ago

@hehonghui 我周末看了一下代码，你的这个字典是比较老的版本的格式，我这边代码没有实现这部分逻辑，需要花点时间调试一下

嗯，我之前的那个mdx的版本是 1.2的版本。最近在测试发现另外一个问题，如果要查询的词在字典中不存在的话会报异常, 进程直接crash了, 看报错应该是 reduce0 函数中mid 和 end 索引出了问题。

参考错误信息如下, 应该是 reduce 函数中出现了问题。

init cost time: 36ms
executable file name: ./target/bin/mydict
mdx : ./New_Oxford_Thesaurus.mdx
word : One
[1]    41774 segmentation fault  ./target/bin/mydict ./New_Oxford_Thesaurus.mdx One

hehonghui commented 3 years ago

reduce0 函数中 mid 和 end 索引的另一个问题是通过python 脚本能够查询到的单词，用 mdict-cpp 却查询不到。我提供一下测试的字典 (可以尝试查询 case 、cake 等词):

New_Oxford_Thesaurus.mdx.zip

terasum commented 3 years ago

@hehonghui 稍等我看一下哈

hehonghui commented 3 years ago

@hehonghui 稍等我看一下哈

我刚才修复了一下 crash, 主要是处理 reduce0 函数中mid 和 end 值, 它们可能大于 key_block_info_list 的长度，也可能全部是0，现有代码没有处理好边界问题。但是查询不到词的问题没有解决~ :) 辛苦老兄

hehonghui commented 3 years ago

看了一下, 是 reduce0的二分查找问题，当前的算法会漏掉一部分索引, 所以导致某些词无法查询掉。直接用 fori 循环也可以简单粗暴的解决问题 ( key_block_info_list列表的size 并不大, 我打印出的测试词典size只有几十, 这种数量级的情况下二分查找和普通的fori循环没啥效率上的差别 ) :

注意reduce相关函数的返回值类型修改为 long 了

/**
 * look the file by word
 * @param word the searching word
 * @return
 */
std::string Mdict::lookup(const std::string word) {
  try 
  {
      // search word in key block info list
      long idx = this->reduce0(word, 0, this->key_block_info_list.size());
      std::cout << "==> lookup idx " << idx << std::endl;
      if (idx >= 0)
      {
        // decode key block by block id
        std::vector<key_list_item*> tlist = this->decode_key_block_by_block_id(idx);
        // reduce word id from key list item vector to get the word index of key list
        long word_id = reduce1(tlist, word);
        if ( word_id >= 0 )
        {
            // reduce search the record block index by word record start offset
            unsigned long record_block_idx = reduce2(tlist[word_id]->record_start);
            // decode recode by record index
            auto vec = decode_record_block_by_rid(record_block_idx);
            //  for(auto it= vec.begin(); it != vec.end(); ++it){
            //   std::cout<<"word: "<<(*it).first<<" \n def: "<<(*it).second<<std::endl;
            //  }
            // reduce the definition by word
            std::string def = reduce3(vec, word);
            return def;
        }
      }
    } 
    catch (std::exception& e)
    {
        std::cout << "==> lookup error" << e.what() << std::endl;
    }
    return std::string();
}

/**
 * find the key word includes in which block
 * @param phrase
 * @param start
 * @param end
 * @return
 */
long Mdict::reduce0(
    std::string phrase, unsigned long start,
    unsigned long end) {  // non-recursive reduce implements
  for (int i = 0; i < end; ++i)
  {
      std::string first_key = this->key_block_info_list[i]->first_key;
      std::string last_key = this->key_block_info_list[i]->last_key;
      std::cout << "index : " << i << ", first_key : " << first_key << ", last_key : " << last_key << std::endl;
     if (phrase.compare(first_key) >= 0 && phrase.compare(last_key) <= 0) 
     {
        std::cout << ">>>>>>>>>>>> found index " << i << std::endl;
        return i;
     }
  }
  return -1;
}

long Mdict::reduce1(
    std::vector<key_list_item*> wordlist,
    std::string phrase) {  // non-recursive reduce implements
  unsigned long left = 0;
  unsigned long right = wordlist.size() - 1;
  unsigned long mid = 0;
  std::string word = _s(std::move(phrase));

  int comp = 0;
  while (left <= right) {
    mid = left + ((right - left) >> 1);
    // std::cout << "reduce1, mid = " << mid << ", left: " << left << ", right : " <<  right << ", size: " << wordlist.size() << std::endl;
    if (mid >= wordlist.size())
    {
      return -1;
    }
    comp = word.compare(_s(wordlist[mid]->key_word));
    if (comp == 0)
    {
      return mid;
    } else if (comp > 0) {
      left = mid + 1;
    } else if (comp < 0) {
      right = mid - 1;
    }
  }
  return -1;
}

terasum commented 3 years ago

@hehonghui 抱歉，我还需要排查一下为什么会查不到这个问题，你的代码现在返回 -1 可以表示找不到这种情况，我还需要测试一下

terasum commented 3 years ago

@hehonghui 昨天晚上我改进了一下测试框架，今天我会把你说的这个问题仔细看下

terasum commented 3 years ago

@hehonghui reduce0 是查找索引的过程，原来的二分查找在对比的时候只对比指定词，但是reduce 是稀疏索引，导致如果对比不到就会跳到下一个词条，所以很容易找不到词：

key_block_info_list
[{aback, cat},{firstKey2, secondKey2},{firstKey3, secondKey3}...]
找到aback还行，要是在aback和cat之间就直接到末尾了，所以你的那种直接遍历的方式是比较好的解决方式

hehonghui commented 3 years ago

@hehonghui reduce0 是查找索引的过程，原来的二分查找在对比的时候只对比指定词，但是reduce 是稀疏索引，导致如果对比不到就会跳到下一个词条，所以很容易找不到词：
key_block_info_list
[{aback, cat},{firstKey2, secondKey2},{firstKey3, secondKey3}...]
找到aback还行，要是在aback和cat之间就直接到末尾了，所以你的那种直接遍历的方式是比较好的解决方式

我今天在跑一些更完整的测试了，我把上面附上的mdx词库用 mdxit-utils 导出为单词列表，然后调用 mdict-cpp 去挨个查询, 目前绝大多数都能够查询到 (总单词数 15773 , 能够查询到的单词为 15724)，现在只还有尾部的四、五十个单词查不到，在 decode_record_block_by_rid 函数抛出异常。

单词列表:

New_Oxford_Thesaurus.txt

测试代码:

#include "mdict_extern.h"

#include <sys/time.h>

#include <cstdlib>
#include <ctime>
#include <iostream>
#include <string>
#include <fstream>
#include <vector>

typedef long long int64;
class Timetool {
 public:
  static int64 getSystemTime() {
    timeval tv;
    gettimeofday(&tv, NULL);
    int64 t = tv.tv_sec;
    t *= 1000;
    t += tv.tv_usec / 1000;
    return t;
  }
};

int main(int argc, char** argv) {
    if (argc < 3) {
        std::cout<<"please specific mdx file and query word."<<std::endl;
        std::cout<<"for example:   ./target/bin/querytest  ./your.mdx  ./your_word_line_by_line.txt" <<std::endl;
        return -1;
    }
    if (strcmp(argv[1], "") == 0) {
        std::cout<<"please specific mdx file"<<std::endl;
        return -1;
    }

    if (strcmp(argv[2], "") == 0) {
        std::cout<<"please specific word record file "<<std::endl;
        return -1;
    }

  int64 t1 = Timetool::getSystemTime();
  void* dict = mdict_init(argv[1], "en_US.aff", "en_US.dic");

  int64 t2 = Timetool::getSystemTime();
  std::cout << "init cost time: " << t2 - t1 << "ms" << std::endl;
  // information
  std::cout<<"executable file name: "<<argv[0]<<std::endl;
  std::cout << "mdx : " << argv[1] <<std::endl;
  std::cout << "word file record : " << argv[2] <<std::endl;

  std::ifstream myfile;
  myfile.open(argv[2]);

  if(!myfile.is_open()) {
        perror("Error open");
        exit(EXIT_FAILURE);
  }

  std::vector<std::string> words;
  std::string line;
  while(getline(myfile, line)) {
      words.push_back(line);
      std::cout << line << std::endl;
  }
  std::cout << "total words : " << words.size() << std::endl;

  int foundCount = 0;
  for (int i = 0; i < words.size(); ++i)
  {
      char* result = mdict_query(dict, words[i].c_str());
      if ( result != nullptr )
      {
          foundCount++;
          std::cout << "found " << words[i] << ", count : " << foundCount << std::endl;
      } 
      else 
      {
          std::cout << words[i] << " not found!" << std::endl;
      }
      if (result != nullptr) {
        free(result);
      }
  }
  std::cout << "foundCount: " << foundCount << ", total: " << words.size() << std::endl;

  assert(foundCount == words.size());

  int64 t3 = Timetool::getSystemTime();
  std::cout << "lookup cost time: " << t3 - t2 << " ms" << std::endl;

  mdict_destory(dict);
}

terasum commented 3 years ago

@hehonghui 我已经修复了查找最后一个词可能访问越界的问题目前你的测试词典最后一个词是 "zoom" 另外一些特殊词如有大写或者包含空格的词还需要修复一下

terasum commented 3 years ago

有大写字母和空格的单子已经能够查询，虽然如下这些单词目前还是有点问题，但不影响使用了：

Table_AFRICAN PEOPLES
Table_ALLOYS
Table_ANTHROPOLOGISTS
Table_ASTRONOMERS
Table_BIOCHEMICAL SUGARS
Table_BRANCHES OF MATHEMATICS
Table_BRANCHES OF PHILOSOPHY
Table_CARDINAL VIRTUES & THEOLOGICAL VIRTUES
Table_CHILDREN'S GAMES
Table_COCKTAILS AND MIXED DRINKS
Table_CONTINENTS OF THE WORLD
Table_CREATURES FROM MYTHOLOGY AND FOLKLORE
Table_CUTS OR JOINTS OF MEAT
Table_DANCES AND TYPES OF DANCING
Table_DIETARY HABITS
Table_DWELLINGS
Table_FAMOUS GANGSTERS
Table_FILM DIRECTORS
Table_FIREWORKS
Table_FLOWERING PLANTS AND SHRUBS
Table_INVENTORS
Table_JAZZ MUSICIANS AND SINGERS
Table_MARSUPIALS
Table_MEASUREMENT UNITS
Table_NAMES OF CANONICAL HOURS
Table_NOBEL PRIZEWINNERS FOR PEACE & NOBEL PRIZEWINNERS FOR ECONOMIC SCIENCES
Table_PHONETIC ALPHABET
Table_POISONOUS PLANTS AND FUNGI
Table_PSYCHIATRISTS, PSYCHOLOGISTS, AND PSYCHOANALYSTS
Table_PSYCHOLOGICAL ILLNESSES AND CONDITIONS
Table_rely on; depend; trust
Table_SALAD DRESSINGS
Table_SEASHELLS
Table_SHAPES OF LENSES
Table_SNAKES
Table_THE THREE GRACES
Table_TITLES OF RULERS
Table_TYPEFACES
Table_TYPES AND FORMS OF PAINTING
Table_TYPES OF ANCHOR
Table_TYPES OF CLERICAL VESTMENT
Table_TYPES OF MUSICAL ORGAN
Table_TYPES OF ROPE
Table_TYPES OF SAW
Table_TYPES OF SCHOOL
Table_TYPES OF STORY
Table_TYPES OF TENT
Table_TYPES OF TOWER
Table_WEAVING TERMS

hehonghui commented 3 years ago

@terasum 昨晚我已经测试了，被测试的mdx查询匹配度已经到了100%。有没有考虑一鼓作气支持小于2.0的版本和lzo解压，这两个一兼容这个库就比较完整了。:)

可参考这个java实现版: https://github.com/KnIfER/mdict-java/blob/master/src/main/java/com/knziha/plod/dictionary/mdict.java

https://github.com/KnIfER/mdict-java/blob/master/src/main/java/com/knziha/plod/dictionary/mdBase.java

terasum commented 3 years ago

@hehonghui 好的，我最近抽时间把这部分搞定

hehonghui commented 2 years ago

@terasum 有后续更新计划么？:)

terasum commented 2 years ago

@hehonghui 目前正在优化 js 版本的，cpp版本还没有时间更新

hehonghui commented 2 years ago

@hehonghui 目前正在优化 js 版本的，cpp版本还没有时间更新 ok

dictlab / mdict-cpp

num_entries_counter in mdict.cc is incorrect in mdx version 2.0 #5