facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.43k stars 6.4k forks source link

Why are training options available in `fairseq-preprocess`? #1713

Closed erip closed 4 years ago

erip commented 4 years ago

❓ Questions and Help

What is your question?

Some of the CLI options that are exposed for -preprocess, -train, and so on seem inappropriate. For example, AMP, loss options, etc. seem like they should only be available at training time but are seen below:

$ fairseq-preprocess --help
usage: fairseq-preprocess [-h] [--no-progress-bar] [--log-interval N]
                          [--log-format {json,none,simple,tqdm}]
                          [--tensorboard-logdir DIR] [--seed N] [--cpu]
                          [--fp16] [--memory-efficient-fp16]
                          [--fp16-no-flatten-grads]
                          [--fp16-init-scale FP16_INIT_SCALE]
                          [--fp16-scale-window FP16_SCALE_WINDOW]
                          [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                          [--min-loss-scale D]
                          [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                          [--user-dir USER_DIR]
                          [--empty-cache-freq EMPTY_CACHE_FREQ]
                          [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                          [--criterion {cross_entropy,adaptive_loss,legacy_masked_lm_loss,nat_loss,label_smoothed_cross_entropy,composite_loss,binary_cross_entropy,sentence_prediction,label_smoothed_cross_entropy_with_alignment,masked_lm,sentence_ranking}]
                          [--tokenizer {nltk,space,moses}]
                          [--bpe {sentencepiece,fastbpe,gpt2,subword_nmt,bert}]
                          [--optimizer {nag,adafactor,sgd,adamax,adagrad,adam,lamb,adadelta}]
                          [--lr-scheduler {fixed,reduce_lr_on_plateau,polynomial_decay,inverse_sqrt,tri_stage,cosine,triangular}]
                          [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP]
                          [--validpref FP] [--testpref FP] [--align-suffix FP]
                          [--destdir DIR] [--thresholdtgt N]
                          [--thresholdsrc N] [--tgtdict FP] [--srcdict FP]
                          [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN]
                          [--dataset-impl FORMAT] [--joined-dictionary]
                          [--only-source] [--padding-factor N] [--workers N]

optional arguments:
  -h, --help            show this help message and exit
  --no-progress-bar     disable progress bar
  --log-interval N      log progress every N batches (when progress bar is
                        disabled)
  --log-format {json,none,simple,tqdm}
                        log format to use
  --tensorboard-logdir DIR
                        path to save logs for tensorboard, should match
                        --logdir of running tensorboard (default: no
                        tensorboard logging)
  --seed N              pseudo random number generator seed
  --cpu                 use CPU instead of CUDA
  --fp16                use FP16
  --memory-efficient-fp16
                        use a memory-efficient version of FP16 training;
                        implies --fp16
  --fp16-no-flatten-grads
                        don't flatten FP16 grads tensor
  --fp16-init-scale FP16_INIT_SCALE
                        default FP16 loss scale
  --fp16-scale-window FP16_SCALE_WINDOW
                        number of updates before increasing loss scale
  --fp16-scale-tolerance FP16_SCALE_TOLERANCE
                        pct of updates that can overflow before decreasing the
                        loss scale
  --min-loss-scale D    minimum FP16 loss scale, after which training is
                        stopped
  --threshold-loss-scale THRESHOLD_LOSS_SCALE
                        threshold FP16 loss scale from below
  --user-dir USER_DIR   path to a python module containing custom extensions
                        (tasks and/or architectures)
  --empty-cache-freq EMPTY_CACHE_FREQ
                        how often to clear the PyTorch CUDA cache (0 to
                        disable)
  --all-gather-list-size ALL_GATHER_LIST_SIZE
                        number of bytes reserved for gathering stats from
                        workers
  --criterion {cross_entropy,adaptive_loss,legacy_masked_lm_loss,nat_loss,label_smoothed_cross_entropy,composite_loss,binary_cross_entropy,sentence_prediction,label_smoothed_cross_entropy_with_alignment,masked_lm,sentence_ranking}
  --tokenizer {nltk,space,moses}
  --bpe {sentencepiece,fastbpe,gpt2,subword_nmt,bert}
  --optimizer {nag,adafactor,sgd,adamax,adagrad,adam,lamb,adadelta}
  --lr-scheduler {fixed,reduce_lr_on_plateau,polynomial_decay,inverse_sqrt,tri_stage,cosine,triangular}
  --task TASK           task
  --dataset-impl FORMAT
                        output dataset implementation

Preprocessing:
  -s SRC, --source-lang SRC
                        source language
  -t TARGET, --target-lang TARGET
                        target language
  --trainpref FP        train file prefix
  --validpref FP        comma separated, valid file prefixes
  --testpref FP         comma separated, test file prefixes
  --align-suffix FP     alignment file suffix
  --destdir DIR         destination dir
  --thresholdtgt N      map words appearing less than threshold times to
                        unknown
  --thresholdsrc N      map words appearing less than threshold times to
                        unknown
  --tgtdict FP          reuse given target dictionary
  --srcdict FP          reuse given source dictionary
  --nwordstgt N         number of target words to retain
  --nwordssrc N         number of source words to retain
  --alignfile ALIGN     an alignment file (optional)
  --joined-dictionary   Generate joined dictionary
  --only-source         Only process the source language
  --padding-factor N    Pad dictionary size to be multiple of N
  --workers N           number of parallel workers

Code

N/A

What have you tried?

N/A

What's your environment?

lematt1991 commented 4 years ago

Not sure, why the --threshold-loss-scale and --min-loss-scale are in get_parser, maybe @myleott knows? The others that seem like they don't belong are included because they use the registry. Is this something that is blocking you in someway or is causing an issue? Otherwise I'd probably just leave it alone since as you probably know from #1672, we are exploring the idea of an additional interface for the CLI

erip commented 4 years ago

Not blocking me - mostly a curiosity thing.

lematt1991 commented 4 years ago

Yeah, I think --threshold-loss-scale, --min-loss-scale and maybe a few others could be moved out of get_parser, but those that appear due to the registry will need to stay. Feel free to submit a PR if you'd like.