preprocessing sentencepiece issue

mrlabied commented 2 years ago

hello I want to build a speech translation model for an Arabic dialect using the fairseq toolkit, so I start by preparing my data to start to train my model. For preparing data for training I have inspired by the example below: https://github.com/facebookresearch/fairseq/blob/main/examples/speech_to_text/prep_mustc_data.py everything goes fine but when it came to this part of the code sp.SentencePieceTrainer.Train(" ".join(arguments)) => source : https://github.com/facebookresearch/fairseq/blob/main/examples/speech_to_text/data_utils.py

1- I run this command for starting preparing data: python custom_example/prep_data.py --data-root D:\Users\user\Documents\projects\code_test\My-DATASET --task asr --vocab-type unigram --vocab-size 5000 everything goes fine but when it came to this part of the code sp.SentencePieceTrainer.Train(" ".join(arguments)) => source : https://github.com/facebookresearch/fairseq/blob/main/examples/speech_to_text/data_utils.py

The script stop with the following error:

denormalizer_spec {} trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator. trainer_interface.cc(178) LOG(INFO) Loading corpus: drj_vocab trainer_interface.cc(385) LOG(INFO) Loaded all 0 sentences trainer_interface.cc(400) LOG(INFO) Adding meta_piece: ~~trainer_interface.cc(400) LOG(INFO) Adding meta_piece: trainer_interface.cc(400) LOG(INFO) Adding meta_piece:~~ trainer_interface.cc(400) LOG(INFO) Adding meta_piece: trainer_interface.cc(405) LOG(INFO) Normalizing sentences... Traceback (most recent call last): ..... ..... File "D:\Users\user\Documents\projects\code_test\fairseq\custom_example\data_utils.py", line 58, in gen_vocab sp.SentencePieceTrainer.Train(" ".join(arguments)) File "D:\projects_envs\fairseq-main\lib\site-packages\sentencepiece__init.py", line 407, in Train return SentencePieceTrainer._TrainFromString(arg) File "D:\projects_envs\fairseq-main\lib\site-packages\sentencepiece\init.py", line 385, in _TrainFromString return _sentencepiece.SentencePieceTrainerTrainFromString(arg) RuntimeError: Internal: C:\projects\sentencepiece\src\trainerinterface.cc(406) [!sentences.empty()]

However I run the same chunk of code on my console using the arguments values which i have already inspect when debugging the program and it work nice.

Please help me to know where is the issue or what i have to change?

gmryu commented 2 years ago

Sorry for I cannot help you since I never did speech translation nor sentencepiece training on my own.

I believe you used an editor with a breakpoint feature to debug the program. When the program works on you console, does it say the following? trainer_interface.cc(385) LOG(INFO) Loaded all 0 sentences Add some breakpoints before those log happens and inspect those variables again? You may also want to see those arguments arg.

Wish your progress.

mrlabied commented 2 years ago

when the program works on my console it logs the following : trainer_interface.cc(178) LOG(INFO) Loading corpus: drj_vocab trainer_interface.cc(385) LOG(INFO) Loaded all 76 sentences ........ ...... trainer_interface.cc(615) LOG(INFO) Saving model: D:/Users/user/Documents/projects/code_test/my-dataset/drj-ar/spm_unigram5000_asr.model trainer_interface.cc(626) LOG(INFO) Saving vocabs: D:/Users/user/Documents/projects/code_test/my-dataset/drj-ar/spm_unigram5000_asr.vocab

** for the printed arguments they are: --input=drj_vocab --model_prefix=D:/Users/maria/Documents/projects/code_test/my-dataset/drj-ar/spm_unigram5000_asr --model_type=unigram --vocab_size=360 --character_coverage=1.0 --num_threads=4 --unk_id=3 --bos_id=0 --eos_id=2 --pad_id=1

gmryu commented 2 years ago

Then you should inspect around those sentences, find out why there are 76 sentences in your console but 0 sentence in the other. To this point, it might not be sentencepiece's problem but might be the data gotten corrupted?

mrlabied commented 2 years ago

but, why the same piece of code give two different outputs?

gmryu commented 2 years ago

Consider the given information, which is really scarce, the leads are following:

you gave --data-root D:\Users\user\Documents\projects\code_test\My-DATASET and got error, then what is the data that given on the console and got 76 sentences ? Are they the same and in the same place? If different, may you swap data and run the code again?
What is the difference in your excution command?
Are there any other difference between the succeed spm and the failed one before 76 sentences and 0 sentences? you may use editor to compare them or even github might do it.
go ask in sentencepiece's github. If you really want to solve it but no one can help, you have to read trainer_interface.cc The C file is in sentencepiece's repository like here https://github.com/google/sentencepiece/blob/master/src/trainer_interface.cc start from searching LOG(INFO) and follow the flow. You first aim will be how do sentencepiece load in those sentences, and why you got 0 sentences.

mrlabied commented 2 years ago

Here I'll show you the log when I run the code for preparing my dataset in which I have logged the args passed to sp.SentencePieceTrainer.Train(" ".join(arguments)) .

args: --input=drj_vocab --model_prefix=D:/Users/maria/Documents/projects/code_test/dar-dataset/drj-ar/spm_unigram32_asr --model_type=unigram --vocab_size=32 --character_coverage=1.0 --num_threads=4 --unk_id=3 --bos_id=0 --eos_id=2 --pad_id =1 sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=drj_vocab --model_prefix=D:/Users/maria/Documents/proje cts/code_test/dar-dataset/drj-ar/spm_unigram32_asr --model_type=unigram --vocab_size=32 --character_coverage=1.0 --num_t hreads=4 --unk_id=3 --bos_id=0 --eos_id=2 --pad_id=1 sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : trainer_spec { input: drj_vocab input_format: model_prefix: D:/Users/maria/Documents/projects/code_test/dar-dataset/drj-ar/spm_unigram32_asr model_type: UNIGRAM vocab_size: 32 self_test_sample_size: 0 character_coverage: 1 input_sentence_size: 0 shuffle_input_sentence: 1 seed_sentencepiece_size: 1000000 shrinking_factor: 0.75 max_sentence_length: 4192 num_threads: 4 num_sub_iterations: 2 max_sentencepiece_length: 16 split_by_unicode_script: 1 split_by_number: 1 split_by_whitespace: 1 split_digits: 0 treat_whitespace_as_suffix: 0 allow_whitespace_only_pieces: 0 required_chars: byte_fallback: 0 vocabulary_output_piece_score: 1 train_extremely_large_corpus: 0 hard_vocab_limit: 1 use_all_vocab: 0 unk_id: 3 bos_id: 0 eos_id: 2 pad_id: 1 unk_piece: <unk> bos_piece: <s> eos_piece: </s> pad_piece: <pad> unk_surface: Ôüç } normalizer_spec { name: nmt_nfkc add_dummy_prefix: 1 remove_extra_whitespaces: 1 escape_whitespaces: 1 normalization_rule_tsv: } denormalizer_spec {} trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator. trainer_interface.cc(178) LOG(INFO) Loading corpus: drj_vocab trainer_interface.cc(385) LOG(INFO) Loaded all 0 sentences trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <s> trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <pad> trainer_interface.cc(400) LOG(INFO) Adding meta_piece: </s> trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <unk> trainer_interface.cc(405) LOG(INFO) Normalizing sentences... Traceback (most recent call last): File "darija_example/prep_darija_data.py", line 237, in <module> main() File "darija_example/prep_darija_data.py", line 233, in main process(args) File "darija_example/prep_darija_data.py", line 177, in process args.vocab_size, File "D:\Users\maria\Documents\projects\code_test\fairseq\darija_example\data_utils.py", line 54, in gen_vocab sp.SentencePieceTrainer.Train(" ".join(arguments)) File "D:\projects_envs\fairseq-main\lib\site-packages\sentencepiece\__init__.py", line 407, in Train return SentencePieceTrainer._TrainFromString(arg) File "D:\projects_envs\fairseq-main\lib\site-packages\sentencepiece\__init__.py", line 385, in _TrainFromString return _sentencepiece.SentencePieceTrainer__TrainFromString(arg) RuntimeError: Internal: C:\projects\sentencepiece\src\trainer_interface.cc(406) [!sentences_.empty()]

then i run the same code with the printed args in my console, and the output is as follow:

spm.SentencePieceTrainer.train("""--input=drj_vocab --model_prefix=D:/Users/maria/Documents/projects/code_test/dar-dataset/drj-ar/spm_unigram32_asr --model_type=unigram --vocab_size=32 --character_coverage=1.0 --num_threads=4 --unk_id=3 --bos_id=0 --eos_id=2 --pad_id=1""") sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=drj_vocab --model_prefix=D:/Users/maria/Documents/projects/code_test/dar-dataset/drj-ar/spm_unigram32_asr --model_type=unigram --vocab_size=32 --character_coverage=1.0 --num_threads=4 --unk_id=3 --bos_id=0 --eos_id=2 --pad_id=1 sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : trainer_spec { input: drj_vocab input_format: model_prefix: D:/Users/maria/Documents/projects/code_test/dar-dataset/drj-ar/spm_unigram32_asr model_type: UNIGRAM vocab_size: 32 self_test_sample_size: 0 character_coverage: 1 input_sentence_size: 0 shuffle_input_sentence: 1 seed_sentencepiece_size: 1000000 shrinking_factor: 0.75 max_sentence_length: 4192 num_threads: 4 num_sub_iterations: 2 max_sentencepiece_length: 16 split_by_unicode_script: 1 split_by_number: 1 split_by_whitespace: 1 split_digits: 0 treat_whitespace_as_suffix: 0 allow_whitespace_only_pieces: 0 required_chars: byte_fallback: 0 vocabulary_output_piece_score: 1 train_extremely_large_corpus: 0 hard_vocab_limit: 1 use_all_vocab: 0 unk_id: 3 bos_id: 0 eos_id: 2 pad_id: 1 unk_piece: <unk> bos_piece: <s> eos_piece: </s> pad_piece: <pad> unk_surface: ⁇ } normalizer_spec { name: nmt_nfkc add_dummy_prefix: 1 remove_extra_whitespaces: 1 escape_whitespaces: 1 normalization_rule_tsv: } denormalizer_spec {} trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator. trainer_interface.cc(178) LOG(INFO) Loading corpus: drj_vocab trainer_interface.cc(385) LOG(INFO) Loaded all 9 sentences trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <s> trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <pad> trainer_interface.cc(400) LOG(INFO) Adding meta_piece: </s> trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <unk> trainer_interface.cc(405) LOG(INFO) Normalizing sentences... trainer_interface.cc(466) LOG(INFO) all chars count=95 trainer_interface.cc(487) LOG(INFO) Alphabet size=24 trainer_interface.cc(488) LOG(INFO) Final character coverage=1 trainer_interface.cc(520) LOG(INFO) Done! preprocessed 9 sentences. unigram_model_trainer.cc(139) LOG(INFO) Making suffix array... unigram_model_trainer.cc(143) LOG(INFO) Extracting frequent sub strings... unigram_model_trainer.cc(194) LOG(INFO) Initialized 29 seed sentencepieces trainer_interface.cc(526) LOG(INFO) Tokenizing input sentences with whitespace: 9 trainer_interface.cc(537) LOG(INFO) Done! 11 unigram_model_trainer.cc(489) LOG(INFO) Using 11 sentences for EM training unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=27 obj=19.4386 num_tokens=62 num_tokens/piece=2.2963 unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=27 obj=19.4748 num_tokens=62 num_tokens/piece=2.2963 trainer_interface.cc(615) LOG(INFO) Saving model: D:/Users/maria/Documents/projects/code_test/dar-dataset/drj-ar/spm_unigram32_asr.model trainer_interface.cc(626) LOG(INFO) Saving vocabs: D:/Users/maria/Documents/projects/code_test/dar-dataset/drj-ar/spm_unigram32_asr.vocab

Do you think really that the issue relied to sentencepiece?

gmryu commented 2 years ago

To determine what raise your issue, it is always a good start to read through your error log. After all, the last line said: sentencepiece\src\trainer_interface.cc(406) [!sentences_.empty()]

This directly tells you what to search in general and which github you want to post question. (No offense. It is your freedom to ask around and it is okay.)

--

So I googled sentencepiece sentences_.empty() , and I found this may help you : https://github.com/google/sentencepiece/issues/517 The author tells at last that too long sentences are ignored due to default max length. You have to check whether the data is written one by one line and raise the acceptable max length. If all solutions in that site do not help, then it is still better to ask in sentencepiece's github. By all means, you have done the basic searching and effort so you deserve more help.

--

It is very offensive and rude of me to tell you this. So I would appologize first. I wrote the factor about data at first and what I got is only log, which is surprising and I felt kind of sad. Well I also checked the log and I noticed that unk_surface are different between both. The failed one has Ôüç but the succeeded one has ⁇. From sentencepiece, it says ?? is the default (not the best). So is this expected? This might happen because of your environment or you specified different vocab.

mrlabied commented 2 years ago

Thank you for your effort to help me solve this issue, and I apologize for not answering your inquiries.

1- I have already visited this sentencepiece issue and I have verified each possibility discussed there: https://github.com/google/sentencepiece/issues/517 but nothing work.

2- In the log, I have joined in the first part I try to log arguments that are passed to sentencepiece.train(" ".join(args)) when running the code to prepare my dataset following the example of must-c to English speech translation. then I take the same logged args and run sentencepiece.train(" ".join(args)) in pycharm console with the same activated environment. I can't understand the different outputs of the same piece of code with the same arguments and the same environment.

3-I apologize for not answering your question about the data. here is a piece of my dataset:

4- the tree of my dataset directory is:

........................

gmryu commented 2 years ago

To be honest, I may not get what you said about excution and environment. At the same time, I am too beginner to help you. I suggest you take this question to sentencepiece's github, too. ( If you already did it, it is great )

By the way, you do not need to reply to me since I am not helpful at all. It is noble of you to report how the issue is solved here once you finished it.

--

For what I can say, there must be at least one difference. I assume that you succeeded when you used the pycharm console and you failed where? (I have not used pycharm so I am not sure about those environment were the same.)

Also the successful command and the failed command,
do they look at different data files?

I know you have looked into all of these. Since you looked troubled, add a few more explanations may help others understand your situation. What are both commands, scripts? Were those and data the same? and Were all run in the same environment? If they are all same and yet you got both success and failure, you must have overlooked something.

facebookresearch / fairseq

preprocessing sentencepiece issue #4439