Closed ibtiRaj closed 1 year ago
The README for prepare_data
answers this. (https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/prepare_data) Here's an example config:
binarization_config:
binarize_workers: 60
max_examples_per_shard: 500000000
random_seed: 0
smallest_shard: 500000
executor_config:
cluster: local
log_folder: executor_logs
preprocessing_config:
max_tokens: null
moses_config:
deescape_special_chars: false
lowercase: false
normalize_punctuation: true
remove_non_printing_chars: false
script_directory: <PATH_TO_FAIRSEQ_DIR>/fairseq-py/examples/nllb/modeling/preprocessing/moses
preprocess_source: true
preprocess_target: true
sample_size: null
tag_data: true
source_vocab_config:
pretrained: null
vocab_build_params:
character_coverage: 0.99995
model_type: bpe
random_seed: 0
sampled_data_size: 10000000
sampling_temperature: 1.0
shuffle_input_sentence: true
use_joined_data: true
vocab_size: 8000
target_vocab_config:
pretrained: null
vocab_build_params:
character_coverage: 0.99995
model_type: bpe
random_seed: 0
sampled_data_size: 10000000
sampling_temperature: 1.0
shuffle_input_sentence: true
use_joined_data: true
vocab_size: 8000
test_corpora:
eng-ibo:
values:
flores_devtest:
data_tag: null
is_gzip: false
num_lines: null
source: <PATH>
target: <PATH>
train_corpora:
eng-ibo:
values:
public_bitext:
data_tag: null
is_gzip: false
num_lines: null
source: <PATH>
target: <PATH>
train_mining_corpora: null
train_mmt_bt_corpora: null
train_smt_bt_corpora: null
valid_corpora:
eng-ibo:
values:
flores_dev:
data_tag: null
is_gzip: false
num_lines: null
source: <PATH>
target: <PATH>
The data_path
format is detailed in the README, you need to organize your corpora files in a specific way for them to be read.
Make sure you download the moses scripts in your fairseq directory. The pipeline runs only if this script exists:
examples/nllb/modeling/preprocessing/moses/clean-corpus-n.perl
(https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/prepare_data/encode_and_binarize.py#L78)
You need to run the filtering pipeline to filter out data based on the following heuristics: length, deduplication, LASER margin score threshold, LID score thresholds, toxicity. It's not sufficient to just get the output of populate_data_conf.py
and compute_length_factors.py
, the output of these scripts are then passed into the filtering pipeline. This is detailed in the README here: https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/filtering
After filtering, you build a vocabulary (SentencePiece model) and then encode and binarize your data which can then be fed into fairseq for training. The filtered datasets (output of filtering pipeline) should be fed into the prepare_data
pipeline.
@kauterry Thank you very much for your detailed answer. I tried to do what you said, first I run the filter script, then I created a config.yaml file as the file you shared with me: But when I run the prepare_data.py I got an error:
You need to run the filtering pipeline to filter out data based on the following heuristics: length, deduplication, LASER margin score threshold, LID score thresholds, toxicity. It's not sufficient to just get the output of
populate_data_conf.py
andcompute_length_factors.py
, the output of these scripts are then passed into the filtering pipeline. This is detailed in the README here: https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/filteringAfter filtering, you build a vocabulary (SentencePiece model) and then encode and binarize your data which can then be fed into fairseq for training. The filtered datasets (output of filtering pipeline) should be fed into the
prepare_data
pipeline.
You said that output of filtering pipeline should be fed into the prepare_data pipeline. but I don't know where and how, such as the parameters of prepare_data.py are:
I'm sorry if I misunderstood and I thank you.
You shouldn't include perl
in your moses script directory. It should be a path to your moses preprocessing scripts.
Yes, you're right thanks. But still the same error: "empty training data"
Can you ls
your script_moses
directory and show me the output?
Is it clear?
This file needs to exist at this exact path: moses_config.script_directory/normalize-punctuation.perl
Make sure you put that file in moses_config.script_directory and adjust this argument accordingly.
I've tried to do what you said: But I still get the same error. So I tried to trace the error and see what is causing the problem. I think the problem is that when prepare_data.py calls the validate_data_config function on line 297, it seems that this function returns empty folders. Do you think this is really the cause of the problem?
@kauterry Thank you very much, I have solved the problem and finished the first and second step: filtering and data preparation.
Hey,
I want to fine tune the NLLB 200 model, as instructed by the data ReadMe documentation, I have tried the filtering pipeline and got the output of populate_data_conf.py and compute_length_factors.py. But I don't know how to run prepare_data pipeline. Especially the parameter required by prepare_data.py : --data-config parameter, Could you provide an example? Thanks a lot.