daac-tools / vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer
https://docs.rs/vibrato
Apache License 2.0
337 stars 14 forks source link

Dictionary::read() cannot read compressed dictionary #122

Closed saito-kosuke closed 1 year ago

saito-kosuke commented 1 year ago

When loading the compiled IPA dictionary based on https://github.com/daac-tools/vibrato#1-dictionary-preparation, the error below occurred.

malloc: can't allocate region : mach_vm_map(size=8776504127295307776, flags: 100) failed (error code=3) malloc: set a breakpoint in malloc_error_break to debug memory allocation of 8776504127295305000 bytes failed

vbkaisetsu commented 1 year ago

@saito-kosuke Thank you for the report.

It seems that an incorrect dictionary was loaded, and Vibrato tried to allocate excessive memory. Multiple versions of dictionary files are available to support different versions of Vibrato. Could you update the branch (git pull), download the latest dictionary, and retry tokenization?

If it still does not work, could you paste the commands you tried?

saito-kosuke commented 1 year ago

@vbkaisetsu Thank you for the reply. I am using https://github.com/daac-tools/vibrato/releases/download/v0.4.0/ipadic-mecab-2_7_0.tar.xz as the dictionary and vibrato v0.4.0. I try to execute the code on lambda rust runtime, so did the error occur?

Here's my code snippet.

let dict = Dictionary::read(File::open("src/dictionary/system.dic.zst")?)?;
let tokenizer = vibrato::Tokenizer::new(dict);
let mut worker = tokenizer.new_worker();
worker.reset_sentence("本場 讃岐うどん 20時間 長期熟成で旨さ、のどごし、コシが格段に違います");
worker.tokenize();
for n in 0..worker.num_tokens() {
    let toten = worker.token(n);
    println!("surface -> {:?}", toten.surface());
    println!("feature -> {:?}", toten.feature());
}
vbkaisetsu commented 1 year ago

@saito-kosuke Thank you for the details.

system.dic.zst is a compressed dictionary, but Dictionary::read() does not support it, so you need to decompress it beforehand. This is an omission in the documentation, so we have to update it.

Could you check if it reads correctly by using zstd crate?

Example:

# Cargo.toml
[dependencies]
zstd = "0.12"
let reader = zstd::Decoder::new(File::open("src/dictionary/system.dic.zst")?)?;
let mut dict = Dictionary::read(reader)?;
saito-kosuke commented 1 year ago

@vbkaisetsu The source code above has worked properly. Thank you for your quick reply. May I ask you one question?

When profiling the source code, the line below takes 7.759 seconds to process. If you know some tips to deal with that, I hopefully want to know.

let mut dict = Dictionary::read(reader)?;
vbkaisetsu commented 1 year ago

@saito-kosuke Unlike MeCab, which performs mmap, Vibrato reads all dictionary data into RAM in advance, so it takes a little longer when reading. In my environment, without optimization (--release flag), the above line takes 1.5 seconds. Without optimization, it takes 6.4 seconds.

saito-kosuke commented 1 year ago

@vbkaisetsu Sounds good. It looks better to me if vibrato::Tokenizer::new can receive a reference rather than a value itself. This is because I want to prepare the dictionary in advance to streamline the process as follows,

static DICTIONARY: OnceCell<Dictionary> = OnceCell::new();
let reader = zstd::Decoder::new(File::open("src/dictionary/system.dic.zst").unwrap()).unwrap();
let dict = Dictionary::read(reader).unwrap();
DICTIONARY.set(dict);

but I can't pass the reference to vibrato::Tokenizer::new now.

let d = DICTIONARY.get().unwrap();
let tokenizer = vibrato::Tokenizer::new(d);
let end = start.elapsed();

If I try to pass out the value, the error below has occurred.

let tokenizer = vibrato::Tokenizer::new(*d);

cannot move out of `*d` which is behind a shared reference
move occurs because `*d` has type `Dictionary`, which does not implement the `Copy` trait
kampersanda commented 1 year ago

@saito-kosuke Thank you for your report.

If you know some tips to deal with that, I hopefully want to know.

One approach to loading a dictionary faster is extracting the zstd-compressed file in advance, as follows.

Although we distribute dictionaries in zstd format and provide CLI tools to handle the compressed ones, our vibrato::dictionary::Dictionary::read() is not restricted to zstd only, accepting Read objects (ref).

Therefore, I recommend extracting system.dic.zst in advance.

$ unzstd ipadic-mecab-2_7_0/system.dic.zst
$ ls ipadic-mecab-2_7_0
system.dic ...

And then, simply load the uncompressed file as follows:

let reader = BufReader::new(File::open("ipadic-mecab-2_7_0/system.dic")?);
let dict = Dictionary::read(reader)?;

With my environment, the loading time was reduced from 0.85 sec to 0.17 sec.

saito-kosuke commented 1 year ago

@kampersanda Thank you for your reply. Understood. Let met try it some more.

saito-kosuke commented 1 year ago

Just to share information, I can prepare the dictionary storing a tokenizer as a global object as follows regarding the below. https://github.com/daac-tools/vibrato/issues/122#issuecomment-1427487320

global object

static TOKENIZER: OnceCell<Tokenizer> = OnceCell::new();

caller

let tokenizer = TOKENIZER.get().unwrap();
let mut worker = tokenizer.new_worker();

preparation

let reader = BufReader::new(File::open("src/dictionary/system.dic")?);
let dict = Dictionary::read(reader)?;
let tokenizer = vibrato::Tokenizer::new(dict);
TOKENIZER.set(tokenizer);

Thank you for your awesome work. We will use the library.