facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.48k stars 2.1k forks source link

--shuffle flag doesn't work #2473

Closed sjkoo1989 closed 4 years ago

sjkoo1989 commented 4 years ago

Bug description In training a poly-encoder with the given tutorial document, we found that --shuffle true tag doesn't work (Also is not properly processed by argument parser). We also found that a class variable 'random' is only dependent on whether the mode is "train" or not.

Expected behavior if --shuffle true tag is given, random variables should be set True, otherwise set False.

Logs Please paste the command line output:

Main ParlAI Arguments:
  -o, --init-opt INIT_OPT
      Path to json file of options. Note: Further Command-line arguments
      override file-based options. (default: None)
  -v, --show-advanced-args
      Show hidden command line options (advanced users only) (default: False)
  -t, --task TASK
      ParlAI task(s), e.g. "babi:Task1" or "babi,cbt" (default: None)
  -dt, --datatype {train,train:stream,train:ordered,train:ordered:stream,train:stream:ordered,train:evalmode,train:evalmode:stream,train:evalmode:ordered,train:evalmode:ordered:stream,train:evalmode:stream:ordered,valid,valid:stream,test,test:stream}
      choose from: train, train:ordered, valid, test. to stream data add
      ":stream" to any option (e.g., train:stream). by default: train is random
      with replacement, valid is ordered, test is ordered. (default: train)
  -nt, --numthreads NUMTHREADS
      number of threads. Used for hogwild if batchsize is 1, else for number of
      threads in threadpool loading, (default: 1)
  -bs, --batchsize BATCHSIZE
      batch size for minibatch training schemes (default: 1)
  -dynb, --dynamic-batching {None,batchsort,full}
      Use dynamic batching (default: None)
  -dp, --datapath DATAPATH
      path to datasets, defaults to {parlai_dir}/data (default: None)

ParlAI Model Arguments:
  -m, --model MODEL
      the model class name. can match parlai/agents/<model> for agents in that
      directory, or can provide a fully specified module for `from X import Y`
      via `-m X:Y` (e.g. `-m parlai.agents.seq2seq.seq2seq:Seq2SeqAgent`)
      (default: None)
  -mf, --model-file MODEL_FILE
      model file name for loading and saving models (default: None)
  -im INIT_MODEL
      load model weights and dict from this file (default: None)

Training Loop Arguments:
  -et, --evaltask EVALTASK
      task to use for valid/test (defaults to the one used for training)
      (default: None)
  -eps, --num-epochs NUM_EPOCHS
  -ttim, --max-train-time MAX_TRAIN_TIME
  -vtim, --validation-every-n-secs VALIDATION_EVERY_N_SECS
      Validate every n seconds. Saves model to model_file (if set) whenever best
      val metric is found (default: -1)
  -stim, --save-every-n-secs SAVE_EVERY_N_SECS
      Saves the model to model_file.checkpoint after every n seconds (default
      -1, never). (default: -1)
  -sval, --save-after-valid SAVE_AFTER_VALID
      Saves the model to model_file.checkpoint after every validation (default
      False).
  -veps, --validation-every-n-epochs VALIDATION_EVERY_N_EPOCHS
      Validate every n epochs. Saves model to model_file (if set) whenever best
      val metric is found (default: -1)
  -vp, --validation-patience VALIDATION_PATIENCE
      number of iterations of validation where result does not improve before we
      stop training (default: 10)
  -vmt, --validation-metric VALIDATION_METRIC
      key into report table for selecting best validation (default: accuracy)
  -vmm, --validation-metric-mode {max,min}
      how to optimize validation metric (max or min) (default: None)
  -mcs, --metrics METRICS
      list of metrics to show/compute, e.g. all, default,or give a list split by
      , like ppl,f1,accuracy,hits@1,rouge,bleuthe rouge metrics will be computed
      as rouge-1, rouge-2 and rouge-l (default: default)

Tensorboard Arguments:
  -tblog, --tensorboard-log TENSORBOARD_LOG
      Tensorboard logging of metrics, default is False

TorchAgent Arguments:
  -i, --interactive-mode INTERACTIVE_MODE
      Whether in full interactive mode or not, which means generating text or
      retrieving from a full set of candidates, which is necessary to actually
      do full dialogue. However, during training or quick validation (e.g. PPL
      for generation or ranking a few candidates for ranking models) you might
      want these set to off. Typically, scripts can set their preferred default
      behavior at the start, e.g. eval scripts. (default: False)
  -emb, --embedding-type {random,glove,glove-fixed,fasttext,fasttext-fixed,fasttext_cc,fasttext_cc-fixed}
      Choose between different strategies for initializing word embeddings.
      Default is random, but can also preinitialize from Glove or Fasttext.
      Preinitialized embeddings can also be fixed so they are not updated during
      training. (default: random)
  -embp, --embedding-projection EMBEDDING_PROJECTION
      If pretrained embeddings have a different dimensionality than your
      embedding size, strategy for projecting to the correct size. If the
      dimensions are the same, this is ignored unless you append "-force" to
      your choice. (default: random)
  --fp16 FP16
      Use fp16 computations. (default: False)
  --fp16-impl {apex,mem_efficient}
      Implementation of FP16 to use (default: apex)
  -rc, --rank-candidates RANK_CANDIDATES
      Whether the model should parse candidates for ranking. (default: False)
  -tr, --truncate TRUNCATE
      Truncate input lengths to increase speed / use less memory. (default:
      1024)
  --text-truncate TEXT_TRUNCATE
      Text input truncation length: if not specified, this will default to
      `truncate` (default: None)
  --label-truncate LABEL_TRUNCATE
      Label truncation length: if not specified, this will default to `truncate`
      (default: None)
  -histsz, --history-size HISTORY_SIZE
      Number of past dialog utterances to remember. (default: -1)
  -pt, --person-tokens PERSON_TOKENS
      add person tokens to history. adds __p1__ in front of input text and
      __p2__ in front of past labels when available or past utterances generated
      by the model. these are added to the dictionary during initialization.
      (default: False)
  --split-lines SPLIT_LINES
      split the dialogue history on newlines and save in separate vectors
      (default: False)
  --delimiter DELIMITER
      Join history lines with this token, defaults to newline (default: )
  -gpu, --gpu GPU
      which GPU to use (default: -1)
  --no-cuda
      disable GPUs even if available. otherwise, will use GPUs if available on
      the device. (default: False)

Optimizer Arguments:
  -opt, --optimizer {adadelta,adagrad,adam,adamw,sparseadam,adamax,asgd,sgd,rprop,rmsprop,optimizer,lbfgs,mem_eff_adam,adafactor}
      Choose between pytorch optimizers. Any member of torch.optim should be
      valid. (default: adamax)
  -lr, --learningrate LEARNINGRATE
      Learning rate (default: 0.0001)
  -clip, --gradient-clip GRADIENT_CLIP
      gradient clipping using l2 norm (default: 0.1)
  --adafactor-eps ADAFACTOR_EPS
      Epsilon values for adafactor optimizer: regularization constants for
      square gradient and parameter scale respectively (default: 1e-30,1e-3)
  -mom, --momentum MOMENTUM
      if applicable, momentum value for optimizer. (default: 0)
  --nesterov NESTEROV
      if applicable, whether to use nesterov momentum. (default: True)
  -nu, --nus NUS
      if applicable, nu value(s) for optimizer. can use a single value like 0.7
      or a comma-separated tuple like 0.7,1.0 (default: 0.7)
  -beta, --betas BETAS
      if applicable, beta value(s) for optimizer. can use a single value like
      0.9 or a comma-separated tuple like 0.9,0.999 (default: 0.9,0.999)
  -wdecay, --weight-decay WEIGHT_DECAY
      Weight decay on the weights. (default: None)

Learning Rate Scheduler:
  --lr-scheduler {reduceonplateau,none,fixed,invsqrt,cosine,linear}
      Learning rate scheduler. (default: reduceonplateau)
  --lr-scheduler-patience LR_SCHEDULER_PATIENCE
      LR scheduler patience. In number of validation runs. If using fixed
      scheduler, LR is decayed every <patience> validations. (default: 3)
  --lr-scheduler-decay LR_SCHEDULER_DECAY
      Decay factor for LR scheduler, or how much LR is multiplied by when it is
      lowered. (default: 0.5)
  --max-lr-steps MAX_LR_STEPS
      Number of train steps the scheduler should take after warmup. Training is
      terminated after this many steps. This should only be set for --lr-
      scheduler cosine or linear (default: -1)
  --invsqrt-lr-decay-gamma INVSQRT_LR_DECAY_GAMMA
      Constant used only to find the lr multiplier for the invsqrt scheduler.
      Must be set for --lr-scheduler invsqrt (default: -1)

TorchRankerAgent:
  -cands, --candidates {batch,inline,fixed,batch-all-cands}
      The source of candidates during training (see
      TorchRankerAgent._build_candidates() for details). (default: inline)
  -ecands, --eval-candidates {batch,inline,fixed,vocab,batch-all-cands}
      The source of candidates during evaluation (defaults to the samevalue as
      --candidates if no flag is given) (default: inline)
  --repeat-blocking-heuristic REPEAT_BLOCKING_HEURISTIC
      Block repeating previous utterances. Helpful for many models that score
      repeats highly, so switched on by default. (default: True)
  -fcp, --fixed-candidates-path FIXED_CANDIDATES_PATH
      A text file of fixed candidates to use for all examples, one candidate per
      line (default: None)
  --fixed-candidate-vecs FIXED_CANDIDATE_VECS
      One of "reuse", "replace", or a path to a file with vectors corresponding
      to the candidates at --fixed-candidates-path. The default path is a
      /path/to/model-file.<cands_name>, where <cands_name> is the name of the
      file (not the full path) passed by the flag --fixed-candidates-path. By
      default, this file is created once and reused. To replace it, use the
      "replace" option. (default: reuse)
  --encode-candidate-vecs ENCODE_CANDIDATE_VECS
      Cache and save the encoding of the candidate vecs. This might be used when
      interacting with the model in real time or evaluating on fixed candidate
      set when the encoding of the candidates is independent of the input.
      (default: True)
  --init-model INIT_MODEL
      Initialize model with weights from this file. (default: None)
  --train-predict TRAIN_PREDICT
      Get predictions and calculate mean rank during the train step. Turning
      this on may slow down training. (default: False)
  --cap-num-predictions CAP_NUM_PREDICTIONS
      Limit to the number of predictions in output.text_candidates (default:
      100)
  --ignore-bad-candidates IGNORE_BAD_CANDIDATES
      Ignore examples for which the label is not present in the label
      candidates. Default behavior results in RuntimeError. (default: False)
  --rank-top-k RANK_TOP_K
      Ranking returns the top k results of k > 0, otherwise sorts every single
      candidate according to the ranking. (default: -1)
  --inference {topk,max}
      Final response output algorithm (default: max)
  --topk TOPK
      K used in Top K sampling inference, when selected (default: 5)

Transformer Arguments:
  -esz, --embedding-size EMBEDDING_SIZE
      Size of all embedding layers (default: 300)
  -nl, --n-layers N_LAYERS
  -hid, --ffn-size FFN_SIZE
      Hidden size of the FFN layers (default: 300)
  --dropout DROPOUT
      Dropout used in Vaswani 2017. (default: 0.0)
  --attention-dropout ATTENTION_DROPOUT
      Dropout used after attention softmax. (default: 0.0)
  --relu-dropout RELU_DROPOUT
      Dropout used after ReLU. From tensor2tensor. (default: 0.0)
  --n-heads N_HEADS
      Number of multihead attention heads (default: 2)
  --learn-positional-embeddings LEARN_POSITIONAL_EMBEDDINGS
  --embeddings-scale EMBEDDINGS_SCALE
  --n-segments N_SEGMENTS
      The number of segments that support the model. If zero no segment and no
      langs_embedding. (default: 0)
  --variant {aiayn,xlm}
      Chooses locations of layer norms, etc. (default: aiayn, recommended: xlm)
  --activation {relu,gelu}
      Nonlinear activation to use. AIAYN uses relu, but more recent papers
      prefer gelu. (default: relu, recommended: gelu)
  --output-scaling OUTPUT_SCALING
      scale the output of every transformer by this quantity. (default: 1.0)
  -nel, --n-encoder-layers N_ENCODER_LAYERS
      This will overide the n-layers for asymmetrical transformers (default: -1)
  -ndl, --n-decoder-layers N_DECODER_LAYERS
      This will overide the n-layers for asymmetrical transformers (default: -1)
  --use-memories USE_MEMORIES
      use memories: must implement the function `_vectorize_memories` to use
      this (default: False)
  --wrap-memory-encoder WRAP_MEMORY_ENCODER
      wrap memory encoder with MLP (default: False)
  --memory-attention {cosine,dot,sqrt}
      similarity for basic attention mechanism when using transformer to encode
      memories (default: sqrt)
  --normalize-sent-emb NORMALIZE_SENT_EMB
  --share-encoders SHARE_ENCODERS
  --learn-embeddings LEARN_EMBEDDINGS
      learn embeddings (default: True)
  --data-parallel DATA_PARALLEL
      use model in data parallel, requires multiple gpus (default: False)
  --reduction-type {first,max,mean}
      Type of reduction at the end of transformer (default: mean)

Polyencoder Arguments:
  --polyencoder-type {codes,n_first}
      Type of polyencoder, either we computevectors using codes + attention, or
      we simply take the first N vectors. (default: codes)
  --poly-n-codes POLY_N_CODES
      number of vectors used to represent the contextin the case of n_first,
      those are the numberof vectors that are considered. (default: 64)
  --poly-attention-type {basic,sqrt,multihead}
      Type of the top aggregation layer of the poly-encoder (where the candidate
      representation isthe key) (default: basic)
  --polyencoder-attention-keys {context,position}
      Input emb vectors for the first level of attention. Context refers to the
      context outputs; position refers to the computed position embeddings.
      (default: context)
  --poly-attention-num-heads POLY_ATTENTION_NUM_HEADS
      In case poly-attention-type is multihead, specify the number of heads
      (default: 4)
  --codes-attention-type {basic,sqrt,multihead}
      Type (default: basic)
  --codes-attention-num-heads CODES_ATTENTION_NUM_HEADS
      In case codes-attention-type is multihead, specify the number of heads
      (default: 4)

Parse Error: unrecognized arguments: --shuffle false

Process finished with exit code 2

Additional context We cloned and tested a master branch.

stephenroller commented 4 years ago

Thanks for filing. This is Will Not Fix, unfortunately.

The --shuffle flag was always misleading (it caused shuffling in a few obscure places, but not the places you would expect). That's why it was removed. If you're copying it from a command, drop it and please report where you got the command from.

Randomization is always off in validation/test to enforce determinism of test results. If you want determinism in training, you can use -dt train:stream.

ParlAI's method is very confusing, but I don't see a way of changing it without potentially breaking a lot of historic code. We face a very difficult balance between maintaining the code of older works, and bettering the platform for the future.

stephenroller commented 4 years ago

(The thing I am changing is updating the docs to remove references to --shuffle)

sjkoo1989 commented 4 years ago

Thanks for your kind response!