I want to finetune an NLLB model on my own data, so according to my vision, the task is relatively simple - to convert my dataset to fairseq format. So I started to use stopes pipelines. But, despite the directory structure of my dataset implies eng_Latn-rus_Cyrl lang direction, config.yaml at the output of the filtering pipeline lists absolutely other lang pairs.
My dataset consists of 2 files (FTData is a root directory for my dataset):
FTData/eng_Latn-rus_Cyrl/mycorpus.eng_Latn.gz,
FTData/eng_Latn-rus_Cyrl/mycorpus.rus_Cyrl.gz.
Then I run:
python stopes/stopes/pipelines/filtering/scripts/populate_data_conf.py --bt-root bt --mined-data-root mined --primary-train-paths FTData --data-conf-dir ConfOutput train_primary,
where bt and mined are empty directories (since I have initially only my own texts without any preprocessing),
then:
python stopes/stopes/pipelines/filtering/scripts/compute_length_factors.py --data-conf-dir ConfOutput --flores-path flores,
where flores is also an empty dir (since I don't need any external corpora, my goal is to finetune only on my data, but --flores-path is a required param to run compute_length_factors.py, so I think I can indicate an arbitrary directory there),
and lastly:
python stopes/stopes/pipelines/filtering/filter.py output_dir=FTFiltered data_conf_dir=ConfOutput.
My FTFiltered/config.yaml file looks as follows:
As you can see, eng_Latn-lij_Latn and eng_Latn-scn_Latn are not contained in my dataset but I got them. In the same time, there is no eng_Latn-rus_Cyrl in my config, but this lang pair is required for me.
Also, I have no understanding why nllbseed and tatoeba are mentioned as included corpora in my config.yaml.
Your config is not good, remove those 2 directions and add eng_Latn-rus_Cyrl (or more likely I think only eng-rus will work). Same for the opposite direction that you need.
You need to add your corpora as well.
Download the flores dataset and use it to prepare those 2 configs, I end up with something like below:
I want to finetune an NLLB model on my own data, so according to my vision, the task is relatively simple - to convert my dataset to fairseq format. So I started to use stopes pipelines. But, despite the directory structure of my dataset implies
eng_Latn-rus_Cyrl
lang direction,config.yaml
at the output of the filtering pipeline lists absolutely other lang pairs. My dataset consists of 2 files (FTData is a root directory for my dataset):FTData/eng_Latn-rus_Cyrl/mycorpus.eng_Latn.gz
,FTData/eng_Latn-rus_Cyrl/mycorpus.rus_Cyrl.gz
. Then I run:python stopes/stopes/pipelines/filtering/scripts/populate_data_conf.py --bt-root bt --mined-data-root mined --primary-train-paths FTData --data-conf-dir ConfOutput train_primary
, wherebt
andmined
are empty directories (since I have initially only my own texts without any preprocessing), then:python stopes/stopes/pipelines/filtering/scripts/compute_length_factors.py --data-conf-dir ConfOutput --flores-path flores
, whereflores
is also an empty dir (since I don't need any external corpora, my goal is to finetune only on my data, but--flores-path
is a required param to runcompute_length_factors.py
, so I think I can indicate an arbitrary directory there), and lastly:python stopes/stopes/pipelines/filtering/filter.py output_dir=FTFiltered data_conf_dir=ConfOutput
. MyFTFiltered/config.yaml
file looks as follows:As you can see,
eng_Latn-lij_Latn
andeng_Latn-scn_Latn
are not contained in my dataset but I got them. In the same time, there is noeng_Latn-rus_Cyrl
in my config, but this lang pair is required for me. Also, I have no understanding why nllbseed and tatoeba are mentioned as included corpora in myconfig.yaml
.