dictlab / mdict-cpp

*.mdx/*.mdd file interpreter cpp implementation
51 stars 13 forks source link

num_entries_counter in mdict.cc is incorrect in mdx version 2.0 #5

Closed hehonghui closed 2 years ago

terasum commented 3 years ago

@hehonghui Can you provide further information about this? mdx file or something?

terasum commented 3 years ago

@hehonghui 你可以提供原始字典文件以及报错信息吗?

hehonghui commented 3 years ago

@hehonghui 你可以提供原始字典文件以及报错信息吗?

terasum 感谢回复, 我的测试环境是如下:

错误信息如下:

executable file name: ./bin/mydict
../mdict-analysis/xnj_jinshan.mdx

Assertion failed: (num_entries_counter == this->entries_num), function decode_key_block_info, file /Users/mrsimple/dev/github/mdict-cpp/mdict.cc, line 1246.
[1]    72930 abort      ./bin/mydict ../mdict-analysis/xnj_jinshan.mdx hello

初步判断是大、小端字节序的转换问题,我的机器上应该是小端字节顺序。

terasum commented 3 years ago

@hehonghui 你的环境是 powerpc 的CPU吗? 大小端字节序应该有做处理,已经是平台无关的,可能是在 byte 转 int的时候出现问题了,这个我本地测试一下

hehonghui commented 3 years ago

我的设备应该是这款CPU 英特尔 I5-5257u ,https://ark.intel.com/content/www/cn/zh/ark/products/84985/intel-core-i5-5257u-processor-3m-cache-up-to-3-10-ghz.html

hehonghui commented 3 years ago

@terasum 新年好,有更新的话可提上来,我可以帮测试一下,另外 goldendict 项目中有QT版的mdx解析实现,可以参考一下。 :)

terasum commented 3 years ago

@hehonghui 抱歉春节没有时间看,近期会修复一下,非常感谢提醒

terasum commented 3 years ago

@hehonghui 我周末看了一下代码,你的这个字典是比较老的版本的格式,我这边代码没有实现这部分逻辑,需要花点时间调试一下

hehonghui commented 3 years ago

@hehonghui 我周末看了一下代码,你的这个字典是比较老的版本的格式,我这边代码没有实现这部分逻辑,需要花点时间调试一下

嗯,我之前的那个mdx的版本是 1.2的版本。最近在测试发现另外一个问题,如果要查询的词在字典中不存在的话会报异常, 进程直接crash了, 看报错应该是 reduce0 函数中mid 和 end 索引出了问题。

参考错误信息如下, 应该是 reduce 函数中出现了问题。

init cost time: 36ms
executable file name: ./target/bin/mydict
mdx : ./New_Oxford_Thesaurus.mdx
word : One
[1]    41774 segmentation fault  ./target/bin/mydict ./New_Oxford_Thesaurus.mdx One
hehonghui commented 3 years ago

reduce0 函数中 mid 和 end 索引的另一个问题是 通过python 脚本能够查询到的单词,用 mdict-cpp 却查询不到。我提供一下测试的字典 (可以尝试查询 case 、cake 等词):

New_Oxford_Thesaurus.mdx.zip

terasum commented 3 years ago

@hehonghui 稍等我看一下哈

hehonghui commented 3 years ago

@hehonghui 稍等我看一下哈

我刚才修复了一下 crash, 主要是处理 reduce0 函数中mid 和 end 值, 它们可能大于 key_block_info_list 的长度,也可能全部是0,现有代码没有处理好边界问题。 但是查询不到词的问题没有解决~ :) 辛苦老兄

hehonghui commented 3 years ago

看了一下, 是 reduce0的二分查找问题,当前的算法会漏掉一部分索引, 所以导致某些词无法查询掉。直接用 fori 循环也可以简单粗暴的解决问题 ( key_block_info_list列表的size 并不大, 我打印出的测试词典size只有几十, 这种数量级的情况下二分查找和普通的fori循环没啥效率上的差别 ) :

注意reduce相关函数的返回值类型修改为 long 了

/**
 * look the file by word
 * @param word the searching word
 * @return
 */
std::string Mdict::lookup(const std::string word) {
  try 
  {
      // search word in key block info list
      long idx = this->reduce0(word, 0, this->key_block_info_list.size());
      std::cout << "==> lookup idx " << idx << std::endl;
      if (idx >= 0)
      {
        // decode key block by block id
        std::vector<key_list_item*> tlist = this->decode_key_block_by_block_id(idx);
        // reduce word id from key list item vector to get the word index of key list
        long word_id = reduce1(tlist, word);
        if ( word_id >= 0 )
        {
            // reduce search the record block index by word record start offset
            unsigned long record_block_idx = reduce2(tlist[word_id]->record_start);
            // decode recode by record index
            auto vec = decode_record_block_by_rid(record_block_idx);
            //  for(auto it= vec.begin(); it != vec.end(); ++it){
            //   std::cout<<"word: "<<(*it).first<<" \n def: "<<(*it).second<<std::endl;
            //  }
            // reduce the definition by word
            std::string def = reduce3(vec, word);
            return def;
        }
      }
    } 
    catch (std::exception& e)
    {
        std::cout << "==> lookup error" << e.what() << std::endl;
    }
    return std::string();
}

/**
 * find the key word includes in which block
 * @param phrase
 * @param start
 * @param end
 * @return
 */
long Mdict::reduce0(
    std::string phrase, unsigned long start,
    unsigned long end) {  // non-recursive reduce implements
  for (int i = 0; i < end; ++i)
  {
      std::string first_key = this->key_block_info_list[i]->first_key;
      std::string last_key = this->key_block_info_list[i]->last_key;
      std::cout << "index : " << i << ", first_key : " << first_key << ", last_key : " << last_key << std::endl;
     if (phrase.compare(first_key) >= 0 && phrase.compare(last_key) <= 0) 
     {
        std::cout << ">>>>>>>>>>>> found index " << i << std::endl;
        return i;
     }
  }
  return -1;
}

long Mdict::reduce1(
    std::vector<key_list_item*> wordlist,
    std::string phrase) {  // non-recursive reduce implements
  unsigned long left = 0;
  unsigned long right = wordlist.size() - 1;
  unsigned long mid = 0;
  std::string word = _s(std::move(phrase));

  int comp = 0;
  while (left <= right) {
    mid = left + ((right - left) >> 1);
    // std::cout << "reduce1, mid = " << mid << ", left: " << left << ", right : " <<  right << ", size: " << wordlist.size() << std::endl;
    if (mid >= wordlist.size())
    {
      return -1;
    }
    comp = word.compare(_s(wordlist[mid]->key_word));
    if (comp == 0)
    {
      return mid;
    } else if (comp > 0) {
      left = mid + 1;
    } else if (comp < 0) {
      right = mid - 1;
    }
  }
  return -1;
}
terasum commented 3 years ago

@hehonghui 抱歉,我还需要排查一下为什么会查不到这个问题,你的代码现在返回 -1 可以表示找不到这种情况,我还需要测试一下

terasum commented 3 years ago

@hehonghui 昨天晚上我改进了一下测试框架,今天我会把你说的这个问题仔细看下

terasum commented 3 years ago

@hehonghui reduce0 是查找索引的过程,原来的二分查找在对比的时候只对比指定词,但是reduce 是稀疏索引,导致如果对比不到就会跳到下一个词条,所以很容易找不到词:

key_block_info_list
[{aback, cat},{firstKey2, secondKey2},{firstKey3, secondKey3}...]
找到aback还行,要是在aback和cat之间就直接到末尾了,所以你的那种直接遍历的方式是比较好的解决方式
hehonghui commented 3 years ago

@hehonghui reduce0 是查找索引的过程,原来的二分查找在对比的时候只对比指定词,但是reduce 是稀疏索引,导致如果对比不到就会跳到下一个词条,所以很容易找不到词:

key_block_info_list
[{aback, cat},{firstKey2, secondKey2},{firstKey3, secondKey3}...]
找到aback还行,要是在aback和cat之间就直接到末尾了,所以你的那种直接遍历的方式是比较好的解决方式

我今天在跑一些更完整的测试了,我把上面附上的mdx词库用 mdxit-utils 导出为单词列表,然后调用 mdict-cpp 去挨个查询, 目前绝大多数都能够查询到 (总单词数 15773 , 能够查询到的单词为 15724),现在只还有尾部的四、五十个单词查不到,在 decode_record_block_by_rid 函数抛出异常。

单词列表:

New_Oxford_Thesaurus.txt

测试代码:

#include "mdict_extern.h"

#include <sys/time.h>

#include <cstdlib>
#include <ctime>
#include <iostream>
#include <string>
#include <fstream>
#include <vector>

typedef long long int64;
class Timetool {
 public:
  static int64 getSystemTime() {
    timeval tv;
    gettimeofday(&tv, NULL);
    int64 t = tv.tv_sec;
    t *= 1000;
    t += tv.tv_usec / 1000;
    return t;
  }
};

int main(int argc, char** argv) {
    if (argc < 3) {
        std::cout<<"please specific mdx file and query word."<<std::endl;
        std::cout<<"for example:   ./target/bin/querytest  ./your.mdx  ./your_word_line_by_line.txt" <<std::endl;
        return -1;
    }
    if (strcmp(argv[1], "") == 0) {
        std::cout<<"please specific mdx file"<<std::endl;
        return -1;
    }

    if (strcmp(argv[2], "") == 0) {
        std::cout<<"please specific word record file "<<std::endl;
        return -1;
    }

  int64 t1 = Timetool::getSystemTime();
  void* dict = mdict_init(argv[1], "en_US.aff", "en_US.dic");

  int64 t2 = Timetool::getSystemTime();
  std::cout << "init cost time: " << t2 - t1 << "ms" << std::endl;
  // information
  std::cout<<"executable file name: "<<argv[0]<<std::endl;
  std::cout << "mdx : " << argv[1] <<std::endl;
  std::cout << "word file record : " << argv[2] <<std::endl;

  std::ifstream myfile;
  myfile.open(argv[2]);

  if(!myfile.is_open()) {
        perror("Error open");
        exit(EXIT_FAILURE);
  }

  std::vector<std::string> words;
  std::string line;
  while(getline(myfile, line)) {
      words.push_back(line);
      std::cout << line << std::endl;
  }
  std::cout << "total words : " << words.size() << std::endl;

  int foundCount = 0;
  for (int i = 0; i < words.size(); ++i)
  {
      char* result = mdict_query(dict, words[i].c_str());
      if ( result != nullptr )
      {
          foundCount++;
          std::cout << "found " << words[i] << ", count : " << foundCount << std::endl;
      } 
      else 
      {
          std::cout << words[i] << " not found!" << std::endl;
      }
      if (result != nullptr) {
        free(result);
      }
  }
  std::cout << "foundCount: " << foundCount << ", total: " << words.size() << std::endl;

  assert(foundCount == words.size());

  int64 t3 = Timetool::getSystemTime();
  std::cout << "lookup cost time: " << t3 - t2 << " ms" << std::endl;

  mdict_destory(dict);
}
terasum commented 3 years ago

@hehonghui 我已经修复了查找最后一个词可能访问越界的问题 目前你的测试词典最后一个词是 "zoom" 另外一些特殊词如有大写或者包含空格的词还需要修复一下

terasum commented 3 years ago

有大写字母和空格的单子已经能够查询,虽然如下这些单词目前还是有点问题,但不影响使用了:

Table_AFRICAN PEOPLES
Table_ALLOYS
Table_ANTHROPOLOGISTS
Table_ASTRONOMERS
Table_BIOCHEMICAL SUGARS
Table_BRANCHES OF MATHEMATICS
Table_BRANCHES OF PHILOSOPHY
Table_CARDINAL VIRTUES & THEOLOGICAL VIRTUES
Table_CHILDREN'S GAMES
Table_COCKTAILS AND MIXED DRINKS
Table_CONTINENTS OF THE WORLD
Table_CREATURES FROM MYTHOLOGY AND FOLKLORE
Table_CUTS OR JOINTS OF MEAT
Table_DANCES AND TYPES OF DANCING
Table_DIETARY HABITS
Table_DWELLINGS
Table_FAMOUS GANGSTERS
Table_FILM DIRECTORS
Table_FIREWORKS
Table_FLOWERING PLANTS AND SHRUBS
Table_INVENTORS
Table_JAZZ MUSICIANS AND SINGERS
Table_MARSUPIALS
Table_MEASUREMENT UNITS
Table_NAMES OF CANONICAL HOURS
Table_NOBEL PRIZEWINNERS FOR PEACE & NOBEL PRIZEWINNERS FOR ECONOMIC SCIENCES
Table_PHONETIC ALPHABET
Table_POISONOUS PLANTS AND FUNGI
Table_PSYCHIATRISTS, PSYCHOLOGISTS, AND PSYCHOANALYSTS
Table_PSYCHOLOGICAL ILLNESSES AND CONDITIONS
Table_rely on; depend; trust
Table_SALAD DRESSINGS
Table_SEASHELLS
Table_SHAPES OF LENSES
Table_SNAKES
Table_THE THREE GRACES
Table_TITLES OF RULERS
Table_TYPEFACES
Table_TYPES AND FORMS OF PAINTING
Table_TYPES OF ANCHOR
Table_TYPES OF CLERICAL VESTMENT
Table_TYPES OF MUSICAL ORGAN
Table_TYPES OF ROPE
Table_TYPES OF SAW
Table_TYPES OF SCHOOL
Table_TYPES OF STORY
Table_TYPES OF TENT
Table_TYPES OF TOWER
Table_WEAVING TERMS
hehonghui commented 3 years ago

@terasum 昨晚我已经测试了,被测试的mdx查询匹配度已经到了100%。有没有考虑一鼓作气支持小于2.0的版本和lzo解压,这两个一兼容这个库就比较完整了。:)

可参考这个java实现版: https://github.com/KnIfER/mdict-java/blob/master/src/main/java/com/knziha/plod/dictionary/mdict.java

https://github.com/KnIfER/mdict-java/blob/master/src/main/java/com/knziha/plod/dictionary/mdBase.java

terasum commented 3 years ago

@hehonghui 好的,我最近抽时间把这部分搞定

hehonghui commented 2 years ago

@terasum 有后续更新计划么?:)

terasum commented 2 years ago

@hehonghui 目前正在优化 js 版本的,cpp版本还没有时间更新

hehonghui commented 2 years ago

@hehonghui 目前正在优化 js 版本的,cpp版本还没有时间更新 ok