Closed hehonghui closed 2 years ago
@hehonghui 你可以提供原始字典文件以及报错信息吗?
@hehonghui 你可以提供原始字典文件以及报错信息吗?
terasum 感谢回复, 我的测试环境是如下:
错误信息如下:
executable file name: ./bin/mydict
../mdict-analysis/xnj_jinshan.mdx
Assertion failed: (num_entries_counter == this->entries_num), function decode_key_block_info, file /Users/mrsimple/dev/github/mdict-cpp/mdict.cc, line 1246.
[1] 72930 abort ./bin/mydict ../mdict-analysis/xnj_jinshan.mdx hello
初步判断是大、小端字节序的转换问题,我的机器上应该是小端字节顺序。
@hehonghui 你的环境是 powerpc 的CPU吗? 大小端字节序应该有做处理,已经是平台无关的,可能是在 byte 转 int的时候出现问题了,这个我本地测试一下
@terasum 新年好,有更新的话可提上来,我可以帮测试一下,另外 goldendict 项目中有QT版的mdx解析实现,可以参考一下。 :)
@hehonghui 抱歉春节没有时间看,近期会修复一下,非常感谢提醒
@hehonghui 我周末看了一下代码,你的这个字典是比较老的版本的格式,我这边代码没有实现这部分逻辑,需要花点时间调试一下
@hehonghui 我周末看了一下代码,你的这个字典是比较老的版本的格式,我这边代码没有实现这部分逻辑,需要花点时间调试一下
嗯,我之前的那个mdx的版本是 1.2的版本。最近在测试发现另外一个问题,如果要查询的词在字典中不存在的话会报异常, 进程直接crash了, 看报错应该是 reduce0 函数中mid 和 end 索引出了问题。
参考错误信息如下, 应该是 reduce 函数中出现了问题。
init cost time: 36ms
executable file name: ./target/bin/mydict
mdx : ./New_Oxford_Thesaurus.mdx
word : One
[1] 41774 segmentation fault ./target/bin/mydict ./New_Oxford_Thesaurus.mdx One
reduce0 函数中 mid 和 end 索引的另一个问题是 通过python 脚本能够查询到的单词,用 mdict-cpp 却查询不到。我提供一下测试的字典 (可以尝试查询 case 、cake 等词):
@hehonghui 稍等我看一下哈
@hehonghui 稍等我看一下哈
我刚才修复了一下 crash, 主要是处理 reduce0 函数中mid 和 end 值, 它们可能大于 key_block_info_list
的长度,也可能全部是0,现有代码没有处理好边界问题。 但是查询不到词的问题没有解决~ :) 辛苦老兄
看了一下, 是 reduce0的二分查找问题,当前的算法会漏掉一部分索引, 所以导致某些词无法查询掉。直接用 fori 循环也可以简单粗暴的解决问题 ( key_block_info_list列表的size 并不大, 我打印出的测试词典size只有几十
, 这种数量级的情况下二分查找和普通的fori循环没啥效率上的差别 ) :
注意reduce相关函数的返回值类型修改为 long 了
/**
* look the file by word
* @param word the searching word
* @return
*/
std::string Mdict::lookup(const std::string word) {
try
{
// search word in key block info list
long idx = this->reduce0(word, 0, this->key_block_info_list.size());
std::cout << "==> lookup idx " << idx << std::endl;
if (idx >= 0)
{
// decode key block by block id
std::vector<key_list_item*> tlist = this->decode_key_block_by_block_id(idx);
// reduce word id from key list item vector to get the word index of key list
long word_id = reduce1(tlist, word);
if ( word_id >= 0 )
{
// reduce search the record block index by word record start offset
unsigned long record_block_idx = reduce2(tlist[word_id]->record_start);
// decode recode by record index
auto vec = decode_record_block_by_rid(record_block_idx);
// for(auto it= vec.begin(); it != vec.end(); ++it){
// std::cout<<"word: "<<(*it).first<<" \n def: "<<(*it).second<<std::endl;
// }
// reduce the definition by word
std::string def = reduce3(vec, word);
return def;
}
}
}
catch (std::exception& e)
{
std::cout << "==> lookup error" << e.what() << std::endl;
}
return std::string();
}
/**
* find the key word includes in which block
* @param phrase
* @param start
* @param end
* @return
*/
long Mdict::reduce0(
std::string phrase, unsigned long start,
unsigned long end) { // non-recursive reduce implements
for (int i = 0; i < end; ++i)
{
std::string first_key = this->key_block_info_list[i]->first_key;
std::string last_key = this->key_block_info_list[i]->last_key;
std::cout << "index : " << i << ", first_key : " << first_key << ", last_key : " << last_key << std::endl;
if (phrase.compare(first_key) >= 0 && phrase.compare(last_key) <= 0)
{
std::cout << ">>>>>>>>>>>> found index " << i << std::endl;
return i;
}
}
return -1;
}
long Mdict::reduce1(
std::vector<key_list_item*> wordlist,
std::string phrase) { // non-recursive reduce implements
unsigned long left = 0;
unsigned long right = wordlist.size() - 1;
unsigned long mid = 0;
std::string word = _s(std::move(phrase));
int comp = 0;
while (left <= right) {
mid = left + ((right - left) >> 1);
// std::cout << "reduce1, mid = " << mid << ", left: " << left << ", right : " << right << ", size: " << wordlist.size() << std::endl;
if (mid >= wordlist.size())
{
return -1;
}
comp = word.compare(_s(wordlist[mid]->key_word));
if (comp == 0)
{
return mid;
} else if (comp > 0) {
left = mid + 1;
} else if (comp < 0) {
right = mid - 1;
}
}
return -1;
}
@hehonghui 抱歉,我还需要排查一下为什么会查不到这个问题,你的代码现在返回 -1
可以表示找不到这种情况,我还需要测试一下
@hehonghui 昨天晚上我改进了一下测试框架,今天我会把你说的这个问题仔细看下
@hehonghui reduce0 是查找索引的过程,原来的二分查找在对比的时候只对比指定词,但是reduce 是稀疏索引,导致如果对比不到就会跳到下一个词条,所以很容易找不到词:
key_block_info_list
[{aback, cat},{firstKey2, secondKey2},{firstKey3, secondKey3}...]
找到aback还行,要是在aback和cat之间就直接到末尾了,所以你的那种直接遍历的方式是比较好的解决方式
@hehonghui reduce0 是查找索引的过程,原来的二分查找在对比的时候只对比指定词,但是reduce 是稀疏索引,导致如果对比不到就会跳到下一个词条,所以很容易找不到词:
key_block_info_list [{aback, cat},{firstKey2, secondKey2},{firstKey3, secondKey3}...] 找到aback还行,要是在aback和cat之间就直接到末尾了,所以你的那种直接遍历的方式是比较好的解决方式
我今天在跑一些更完整的测试了,我把上面附上的mdx词库用 mdxit-utils 导出为单词列表,然后调用 mdict-cpp 去挨个查询, 目前绝大多数都能够查询到 (总单词数 15773 , 能够查询到的单词为 15724),现在只还有尾部的四、五十个单词查不到,在 decode_record_block_by_rid
函数抛出异常。
单词列表:
测试代码:
#include "mdict_extern.h"
#include <sys/time.h>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
typedef long long int64;
class Timetool {
public:
static int64 getSystemTime() {
timeval tv;
gettimeofday(&tv, NULL);
int64 t = tv.tv_sec;
t *= 1000;
t += tv.tv_usec / 1000;
return t;
}
};
int main(int argc, char** argv) {
if (argc < 3) {
std::cout<<"please specific mdx file and query word."<<std::endl;
std::cout<<"for example: ./target/bin/querytest ./your.mdx ./your_word_line_by_line.txt" <<std::endl;
return -1;
}
if (strcmp(argv[1], "") == 0) {
std::cout<<"please specific mdx file"<<std::endl;
return -1;
}
if (strcmp(argv[2], "") == 0) {
std::cout<<"please specific word record file "<<std::endl;
return -1;
}
int64 t1 = Timetool::getSystemTime();
void* dict = mdict_init(argv[1], "en_US.aff", "en_US.dic");
int64 t2 = Timetool::getSystemTime();
std::cout << "init cost time: " << t2 - t1 << "ms" << std::endl;
// information
std::cout<<"executable file name: "<<argv[0]<<std::endl;
std::cout << "mdx : " << argv[1] <<std::endl;
std::cout << "word file record : " << argv[2] <<std::endl;
std::ifstream myfile;
myfile.open(argv[2]);
if(!myfile.is_open()) {
perror("Error open");
exit(EXIT_FAILURE);
}
std::vector<std::string> words;
std::string line;
while(getline(myfile, line)) {
words.push_back(line);
std::cout << line << std::endl;
}
std::cout << "total words : " << words.size() << std::endl;
int foundCount = 0;
for (int i = 0; i < words.size(); ++i)
{
char* result = mdict_query(dict, words[i].c_str());
if ( result != nullptr )
{
foundCount++;
std::cout << "found " << words[i] << ", count : " << foundCount << std::endl;
}
else
{
std::cout << words[i] << " not found!" << std::endl;
}
if (result != nullptr) {
free(result);
}
}
std::cout << "foundCount: " << foundCount << ", total: " << words.size() << std::endl;
assert(foundCount == words.size());
int64 t3 = Timetool::getSystemTime();
std::cout << "lookup cost time: " << t3 - t2 << " ms" << std::endl;
mdict_destory(dict);
}
@hehonghui 我已经修复了查找最后一个词可能访问越界的问题 目前你的测试词典最后一个词是 "zoom" 另外一些特殊词如有大写或者包含空格的词还需要修复一下
有大写字母和空格的单子已经能够查询,虽然如下这些单词目前还是有点问题,但不影响使用了:
Table_AFRICAN PEOPLES
Table_ALLOYS
Table_ANTHROPOLOGISTS
Table_ASTRONOMERS
Table_BIOCHEMICAL SUGARS
Table_BRANCHES OF MATHEMATICS
Table_BRANCHES OF PHILOSOPHY
Table_CARDINAL VIRTUES & THEOLOGICAL VIRTUES
Table_CHILDREN'S GAMES
Table_COCKTAILS AND MIXED DRINKS
Table_CONTINENTS OF THE WORLD
Table_CREATURES FROM MYTHOLOGY AND FOLKLORE
Table_CUTS OR JOINTS OF MEAT
Table_DANCES AND TYPES OF DANCING
Table_DIETARY HABITS
Table_DWELLINGS
Table_FAMOUS GANGSTERS
Table_FILM DIRECTORS
Table_FIREWORKS
Table_FLOWERING PLANTS AND SHRUBS
Table_INVENTORS
Table_JAZZ MUSICIANS AND SINGERS
Table_MARSUPIALS
Table_MEASUREMENT UNITS
Table_NAMES OF CANONICAL HOURS
Table_NOBEL PRIZEWINNERS FOR PEACE & NOBEL PRIZEWINNERS FOR ECONOMIC SCIENCES
Table_PHONETIC ALPHABET
Table_POISONOUS PLANTS AND FUNGI
Table_PSYCHIATRISTS, PSYCHOLOGISTS, AND PSYCHOANALYSTS
Table_PSYCHOLOGICAL ILLNESSES AND CONDITIONS
Table_rely on; depend; trust
Table_SALAD DRESSINGS
Table_SEASHELLS
Table_SHAPES OF LENSES
Table_SNAKES
Table_THE THREE GRACES
Table_TITLES OF RULERS
Table_TYPEFACES
Table_TYPES AND FORMS OF PAINTING
Table_TYPES OF ANCHOR
Table_TYPES OF CLERICAL VESTMENT
Table_TYPES OF MUSICAL ORGAN
Table_TYPES OF ROPE
Table_TYPES OF SAW
Table_TYPES OF SCHOOL
Table_TYPES OF STORY
Table_TYPES OF TENT
Table_TYPES OF TOWER
Table_WEAVING TERMS
@terasum 昨晚我已经测试了,被测试的mdx查询匹配度已经到了100%。有没有考虑一鼓作气支持小于2.0的版本和lzo解压,这两个一兼容这个库就比较完整了。:)
可参考这个java实现版: https://github.com/KnIfER/mdict-java/blob/master/src/main/java/com/knziha/plod/dictionary/mdict.java
@hehonghui 好的,我最近抽时间把这部分搞定
@terasum 有后续更新计划么?:)
@hehonghui 目前正在优化 js 版本的,cpp版本还没有时间更新
@hehonghui 目前正在优化 js 版本的,cpp版本还没有时间更新 ok
@hehonghui Can you provide further information about this? mdx file or something?