facebookresearch / stopes

A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team.
https://facebookresearch.github.io/stopes/
MIT License
251 stars 37 forks source link

Filtering pipeline produces a config with wrong lang directions #27

Open molokanov50 opened 1 year ago

molokanov50 commented 1 year ago

I want to finetune an NLLB model on my own data, so according to my vision, the task is relatively simple - to convert my dataset to fairseq format. So I started to use stopes pipelines. But, despite the directory structure of my dataset implies eng_Latn-rus_Cyrl lang direction, config.yaml at the output of the filtering pipeline lists absolutely other lang pairs. My dataset consists of 2 files (FTData is a root directory for my dataset): FTData/eng_Latn-rus_Cyrl/mycorpus.eng_Latn.gz, FTData/eng_Latn-rus_Cyrl/mycorpus.rus_Cyrl.gz. Then I run: python stopes/stopes/pipelines/filtering/scripts/populate_data_conf.py --bt-root bt --mined-data-root mined --primary-train-paths FTData --data-conf-dir ConfOutput train_primary, where bt and mined are empty directories (since I have initially only my own texts without any preprocessing), then: python stopes/stopes/pipelines/filtering/scripts/compute_length_factors.py --data-conf-dir ConfOutput --flores-path flores, where flores is also an empty dir (since I don't need any external corpora, my goal is to finetune only on my data, but --flores-path is a required param to run compute_length_factors.py, so I think I can indicate an arbitrary directory there), and lastly: python stopes/stopes/pipelines/filtering/filter.py output_dir=FTFiltered data_conf_dir=ConfOutput. My FTFiltered/config.yaml file looks as follows:

data_conf_dir: /home/molokanov/myapp3/ConfOutput
directions:
- eng_Latn-lij_Latn
- eng_Latn-scn_Latn
executor:
  cluster: local
  log_folder: executor_logs
  slurm_partition: null
output_dir: /home/molokanov/myapp3/FTFiltered
train_bt: null
train_mined: null
train_primary:
  dedup_filter:
    _target_: stopes.pipelines.filtering.filters.DedupFilter
    dedup_pairs: true
    max_source_dedup: null
    max_target_dedup: null
  excluded_corpora: null
  included_corpora:
  - nllbseed
  - tatoeba
  laser_filter: null
  length_filter:
    _target_: stopes.pipelines.filtering.filters.LengthFilter
    max_len: 1050
    max_len_ratio: 9.0
    min_len: 5
    min_src_unique_ratio: null
  lid_filter: null
  normalize_punctuation: true
  normalize_unicode: false
  toxicity_filter: null

As you can see, eng_Latn-lij_Latn and eng_Latn-scn_Latn are not contained in my dataset but I got them. In the same time, there is no eng_Latn-rus_Cyrl in my config, but this lang pair is required for me. Also, I have no understanding why nllbseed and tatoeba are mentioned as included corpora in my config.yaml.

gordicaleksa commented 1 year ago
  1. Your config is not good, remove those 2 directions and add eng_Latn-rus_Cyrl (or more likely I think only eng-rus will work). Same for the opposite direction that you need.
  2. You need to add your corpora as well.
  3. Download the flores dataset and use it to prepare those 2 configs, I end up with something like below: image