AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
214 stars 59 forks source link

having issue in finetuning . #54

Closed pr509 closed 5 months ago

pr509 commented 5 months ago

i am having issue while finetuning the model. after doing all the preprocessing when i try to run the finetune.sh shell . i am getting an error which shows "The dataset is empty. This could indicate that all elements in the dataset have been skipped. Try increasing the max number of allowed tokens or using a larger dataset." .and also can you tell me that should i have to make any changes in the code if i want to do finetuning for only 4 language pair .

PranjalChitale commented 5 months ago

Can you confirm if you built sentencepiece from source following the steps ?

Can you paste the preprocess.log inside the final_bin directory here, that would help us understand if binarization was successful or not.

Can you check if GNU parallel is installed, if not you can either install it or remove all the instances of parallel --pipe --keep-order in prepare_data_joint_finetuning.sh and apply_sentence_piece.sh and retry.

pr509 commented 5 months ago

i have created the dataset in same form which the indictrans2 repo suggests.

On Sat, Mar 23, 2024 at 3:25 PM Pranjal Chitale @.***> wrote:

Can you confirm if you built sentencepiece from source following the steps https://github.com/google/sentencepiece/?tab=readme-ov-file#build-and-install-sentencepiece-command-line-tools-from-c-source ?

Can you paste the preprocess.log inside the final_bin directory here, that would help us understand if binarization was successful or not.

— Reply to this email directly, view it on GitHub https://github.com/AI4Bharat/IndicTrans2/issues/54#issuecomment-2016431662, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARQOKD7Z66QOPFQMAAXJTKLYZVGR5AVCNFSM6AAAAABFEOAIKWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJWGQZTCNRWGI . You are receiving this because you authored the thread.Message ID: @.***>

pr509 commented 5 months ago

this is the python script i have created .

On Sat, Mar 23, 2024 at 6:19 PM priyanshu shekhar < @.***> wrote:

i have created the dataset in same form which the indictrans2 repo suggests.

On Sat, Mar 23, 2024 at 3:25 PM Pranjal Chitale @.***> wrote:

Can you confirm if you built sentencepiece from source following the steps https://github.com/google/sentencepiece/?tab=readme-ov-file#build-and-install-sentencepiece-command-line-tools-from-c-source ?

Can you paste the preprocess.log inside the final_bin directory here, that would help us understand if binarization was successful or not.

— Reply to this email directly, view it on GitHub https://github.com/AI4Bharat/IndicTrans2/issues/54#issuecomment-2016431662, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARQOKD7Z66QOPFQMAAXJTKLYZVGR5AVCNFSM6AAAAABFEOAIKWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJWGQZTCNRWGI . You are receiving this because you authored the thread.Message ID: @.***>

PranjalChitale commented 5 months ago

The script you shared is not accessible.

Also, request you to please read my comment and provide the requisite details I had asked for us to go forward and debug the issue.

pr509 commented 5 months ago

yes i have build the sentancepiece from source .i am sharing the preprocess.log preprocess.log

PranjalChitale commented 5 months ago

Can you double check your training data?

From the logs, it appears that you are just trying to train with 4 sentences, and out of that as well half of the tokens are being replaced by unknown tokens.

pr509 commented 5 months ago

thanks for your help .the issue has been resolved . i just want one help have you trained your model for do not translate can you help you out where could i use do not translate in the model