Sentence Piece tokenizer from script "process_asr_text_tokenizer"

sac-1999 commented 2 years ago

I was finetuning pretrained citrinet 512 model (Encoder Freeze) for filipino language where i trained a custom tokenizer with vocab size - 1024 and replaced the pretrained citrinet tokenizer section with my custom tokenizer.

data_dir = "." !python3 ./scripts/process_asr_text_tokenizer.py --manifest="{data_dir}/Filipino/train_manifest.json" --data_root="{data_dir}/complete_tg/tokenizers/tokenizer_1/" --vocab_size=1024 --tokenizer="spe" --spe_type="bpe" --spe_character_coverage=1.0 --log

Sample vocab -:

▁

a

i

o

n

t

c

e

l

r

s

p

f

g

k

m

v

This script gives tokenizer.model and vocabulary, In vocab a special character "▁" is present. This is not a normal underscore which is "_".

Can someone explain me why i am getting this special token as due to this special character i am unable to deploy my asr in riva speech with flashlight decoder. Error got during deployment with flashlight decoder -:

`I0712 13:32:15.565105 94 ctc-decoder-library.cc:23] TRITONBACKEND_ModelInstanceInitialize: nemo_asr_riva_pipeline_filipino-ctc-decoder-cpu-streaming_0 (device 0) terminate called after throwing an instance of 'std::runtime_error' what(): [LoadWords] Invalid line: ▁

Riva waiting for Triton server to load all models...retrying in 1 second /opt/riva/bin/start-riva: line 4: 94 Aborted (core dumped) tritonserver --log-verbose=0 --strict-model-config=true $model_repos --cuda-memory-pool-byte-size=0:1000000000`

itzsimpl commented 2 years ago

@sac-1999 the special character indicates a space and is used by SPE to mark/detect word boundaries.

There may have be multiple reasons for you seeing this. Your vocabulary may contain characters that can not be tokenised. The issue may also be related to the type of SentencePiece model and the way Riva's servicemaker script converts the supplied vocabulary to lexicon format, i.e. running the tokenizer over the supplied vocabulary (https://forums.developer.nvidia.com/t/bug-riva-deploy-model-with-non-unigram-bpe-tokenizer/200522).

Anyway, a workaround is to construct the lexicon on your own (checking for uncovered characters and all) and instead of supplying the vocabulary --decoding_vocab, supply the lexicon --decoding_lexicon.

There's a special section in Riva documentation dedicated to lexicon preparation https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/asr-python-advanced-customize-vocabulary-and-lexicon.html#customizing-pronunciation-with-lexicon-mapping, where everything is explained in details.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 60 days with no activity.

github-actions[bot] commented 2 years ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

NVIDIA / NeMo

Sentence Piece tokenizer from script "process_asr_text_tokenizer" #4654

a

i

o

n

t

c

e

l

r

s

p

f

g

k

m

v