facebookresearch / stopes

A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team.
https://facebookresearch.github.io/stopes/
MIT License
247 stars 37 forks source link

How to create training data through pipeline #14

Open b3y0nd opened 1 year ago

b3y0nd commented 1 year ago

I want to train the NLLB model, as instructed by the data ReadMe documentation, I have tried the filtering pipeline and got the output of populate_data_conf.py and compute_length_factors.py. But I don't know how to run prepare_data pipeline. Especially the three parameters required by prepare_data.py, such as the yaml file required by the --data-config parameter, etc. Could you provide an example? Thanks a lot.

In addition, what is the relationship between filtering pipeline and prepare_data pipeline? The latter doesn't seem to use the output of the former.

The compute_length_factors.py used in the filtering pipeline doesn't seem to be updated as it requires the flores101 dataset instead of the flores200..

kauterry commented 1 year ago

The README for prepare_data answers this. (https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/prepare_data) Here's an example config:

binarization_config:
  binarize_workers: 60
  max_examples_per_shard: 500000000
  random_seed: 0
  smallest_shard: 500000
executor_config:
  cluster: local
  log_folder: executor_logs
preprocessing_config:
  max_tokens: null
  moses_config:
    deescape_special_chars: false
    lowercase: false
    normalize_punctuation: true
    remove_non_printing_chars: false
    script_directory: <PATH_TO_FAIRSEQ_DIR>/fairseq-py/examples/nllb/modeling/preprocessing/moses
  preprocess_source: true
  preprocess_target: true
  sample_size: null
  tag_data: true
source_vocab_config:
  pretrained: null
  vocab_build_params:
    character_coverage: 0.99995
    model_type: bpe
    random_seed: 0
    sampled_data_size: 10000000
    sampling_temperature: 1.0
    shuffle_input_sentence: true
    use_joined_data: true
    vocab_size: 8000
target_vocab_config:
  pretrained: null
  vocab_build_params:
    character_coverage: 0.99995
    model_type: bpe
    random_seed: 0
    sampled_data_size: 10000000
    sampling_temperature: 1.0
    shuffle_input_sentence: true
    use_joined_data: true
    vocab_size: 8000
test_corpora:
  eng-ibo:
    values:
      flores_devtest:
        data_tag: null
        is_gzip: false
        num_lines: null
        source: <PATH>
        target: <PATH>
train_corpora:
  eng-ibo:
    values:
      public_bitext:
        data_tag: null
        is_gzip: false
        num_lines: null
        source: <PATH>
        target: <PATH>
train_mining_corpora: null
train_mmt_bt_corpora: null
train_smt_bt_corpora: null
valid_corpora:
  eng-ibo:
    values:
      flores_dev:
        data_tag: null
        is_gzip: false
        num_lines: null
        source: <PATH>
        target: <PATH>

The data_path format is detailed in the README, you need to organize your corpora files in a specific way for them to be read.

Make sure you download the moses scripts in your fairseq directory. The pipeline runs only if this script exists: examples/nllb/modeling/preprocessing/moses/clean-corpus-n.perl (https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/prepare_data/encode_and_binarize.py#L78)

kauterry commented 1 year ago

You need to run the filtering pipeline to filter out data based on the following heuristics: length, deduplication, LASER margin score threshold, LID score thresholds, toxicity. It's not sufficient to just get the output of populate_data_conf.py and compute_length_factors.py, the output of these scripts are then passed into the filtering pipeline. This is detailed in the README here: https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/filtering

After filtering, you build a vocabulary (SentencePiece model) and then encode and binarize your data which can then be fed into fairseq for training. The filtered datasets (output of filtering pipeline) should be fed into the prepare_data pipeline. You're right, we should have the filtering output the data_config for prepare_data. We are working on such changes to refactor both these pipelines. We'll push out a change soon to address that.

b3y0nd commented 1 year ago

You need to run the filtering pipeline to filter out data based on the following heuristics: length, deduplication, LASER margin score threshold, LID score thresholds, toxicity. It's not sufficient to just get the output of populate_data_conf.py and compute_length_factors.py, the output of these scripts are then passed into the filtering pipeline. This is detailed in the README here: https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/filtering

After filtering, you build a vocabulary (SentencePiece model) and then encode and binarize your data which can then be fed into fairseq for training. The filtered datasets (output of filtering pipeline) should be fed into the prepare_data pipeline. You're right, we should have the filtering output the data_config for prepare_data. We are working on such changes to refactor both these pipelines. We'll push out a change soon to address that.

Thank you very much for your answer, I read another issue #15, and then I have the same question, how does the prepare_data pipeline use the output of the filtering pipeline? From the parameters of running the prepare_data pipeline, the two seem to be unrelated.

kauterry commented 1 year ago

Apologies, you are correct. Currently the filtering pipeline doesn't output the input config of the prepare_data pipeline which is inconvenient for user. We're working on completely refactoring the two pipelines to be well integrated with Stopes, as well as have filtering produce the input config of prepare_data. I'm sorry about that. You can look at the prepare_data input config format in its README and write a short script to create that, based on the filtered src, tgt files for all directions x data sources.