facebookresearch / stopes

A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team.
https://facebookresearch.github.io/stopes/
MIT License
247 stars 37 forks source link

Filtering and Preparing the Data to finetune NLLB-200 #15

Closed ibtiRaj closed 1 year ago

ibtiRaj commented 1 year ago

Hey,

I want to fine tune the NLLB 200 model, as instructed by the data ReadMe documentation, I have tried the filtering pipeline and got the output of populate_data_conf.py and compute_length_factors.py. But I don't know how to run prepare_data pipeline. Especially the parameter required by prepare_data.py : --data-config parameter, Could you provide an example? Thanks a lot.

kauterry commented 1 year ago

The README for prepare_data answers this. (https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/prepare_data) Here's an example config:

binarization_config:
  binarize_workers: 60
  max_examples_per_shard: 500000000
  random_seed: 0
  smallest_shard: 500000
executor_config:
  cluster: local
  log_folder: executor_logs
preprocessing_config:
  max_tokens: null
  moses_config:
    deescape_special_chars: false
    lowercase: false
    normalize_punctuation: true
    remove_non_printing_chars: false
    script_directory: <PATH_TO_FAIRSEQ_DIR>/fairseq-py/examples/nllb/modeling/preprocessing/moses
  preprocess_source: true
  preprocess_target: true
  sample_size: null
  tag_data: true
source_vocab_config:
  pretrained: null
  vocab_build_params:
    character_coverage: 0.99995
    model_type: bpe
    random_seed: 0
    sampled_data_size: 10000000
    sampling_temperature: 1.0
    shuffle_input_sentence: true
    use_joined_data: true
    vocab_size: 8000
target_vocab_config:
  pretrained: null
  vocab_build_params:
    character_coverage: 0.99995
    model_type: bpe
    random_seed: 0
    sampled_data_size: 10000000
    sampling_temperature: 1.0
    shuffle_input_sentence: true
    use_joined_data: true
    vocab_size: 8000
test_corpora:
  eng-ibo:
    values:
      flores_devtest:
        data_tag: null
        is_gzip: false
        num_lines: null
        source: <PATH>
        target: <PATH>
train_corpora:
  eng-ibo:
    values:
      public_bitext:
        data_tag: null
        is_gzip: false
        num_lines: null
        source: <PATH>
        target: <PATH>
train_mining_corpora: null
train_mmt_bt_corpora: null
train_smt_bt_corpora: null
valid_corpora:
  eng-ibo:
    values:
      flores_dev:
        data_tag: null
        is_gzip: false
        num_lines: null
        source: <PATH>
        target: <PATH>

The data_path format is detailed in the README, you need to organize your corpora files in a specific way for them to be read.

Make sure you download the moses scripts in your fairseq directory. The pipeline runs only if this script exists: examples/nllb/modeling/preprocessing/moses/clean-corpus-n.perl (https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/prepare_data/encode_and_binarize.py#L78)

kauterry commented 1 year ago

You need to run the filtering pipeline to filter out data based on the following heuristics: length, deduplication, LASER margin score threshold, LID score thresholds, toxicity. It's not sufficient to just get the output of populate_data_conf.py and compute_length_factors.py, the output of these scripts are then passed into the filtering pipeline. This is detailed in the README here: https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/filtering

After filtering, you build a vocabulary (SentencePiece model) and then encode and binarize your data which can then be fed into fairseq for training. The filtered datasets (output of filtering pipeline) should be fed into the prepare_data pipeline.

ibtiRaj commented 1 year ago

@kauterry Thank you very much for your detailed answer. I tried to do what you said, first I run the filter script, then I created a config.yaml file as the file you shared with me: config But when I run the prepare_data.py I got an error: error

ibtiRaj commented 1 year ago

You need to run the filtering pipeline to filter out data based on the following heuristics: length, deduplication, LASER margin score threshold, LID score thresholds, toxicity. It's not sufficient to just get the output of populate_data_conf.py and compute_length_factors.py, the output of these scripts are then passed into the filtering pipeline. This is detailed in the README here: https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/filtering

After filtering, you build a vocabulary (SentencePiece model) and then encode and binarize your data which can then be fed into fairseq for training. The filtered datasets (output of filtering pipeline) should be fed into the prepare_data pipeline.

You said that output of filtering pipeline should be fed into the prepare_data pipeline. but I don't know where and how, such as the parameters of prepare_data.py are:

I'm sorry if I misunderstood and I thank you.

kauterry commented 1 year ago

You shouldn't include perl in your moses script directory. It should be a path to your moses preprocessing scripts.

ibtiRaj commented 1 year ago

Yes, you're right thanks. But still the same error: "empty training data"

kauterry commented 1 year ago

Can you ls your script_moses directory and show me the output?

ibtiRaj commented 1 year ago

ls

Is it clear?

ibtiRaj commented 1 year ago

ls2

kauterry commented 1 year ago

This file needs to exist at this exact path: moses_config.script_directory/normalize-punctuation.perl Make sure you put that file in moses_config.script_directory and adjust this argument accordingly.

ibtiRaj commented 1 year ago

I've tried to do what you said: image But I still get the same error. So I tried to trace the error and see what is causing the problem. I think the problem is that when prepare_data.py calls the validate_data_config function on line 297, it seems that this function returns empty folders. Do you think this is really the cause of the problem?

image

image

image

ibtiRaj commented 1 year ago

@kauterry Thank you very much, I have solved the problem and finished the first and second step: filtering and data preparation.