MorinoseiMorizo / jparacrawl-finetune

An example usage of JParaCrawl pre-trained Neural Machine Translation (NMT) models.
http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/
103 stars 8 forks source link

AttributeError: 'NoneType' object has no attribute 'task' #11

Closed su0315 closed 1 year ago

su0315 commented 1 year ago

Hi! Thanks for publishing the example usage, it helps me a lot.

I am finetuning JParaCrawl base model on Business Scene Dialogue Corpus https://github.com/tsuruoka-lab/BSD. I coded the same parameter as your repo's fine-tuning .sh file (fine-tune_kftt_fp32.sh) However, it gives the error on the title above.

Here is my code on the sh file.

FAIRSEQ=/home/sumire/miniconda3/envs/github_fairseq/lib/python3.8/site-packages/fairseq

SEED=1

PRETRAINED_MODEL_FILE=/home/sumire/main/NMT_models/jparacrawl/en-ja/base_en-ja/base/base.pretrain.pt 
MODEL_DIR=/home/sumire/main/NMT_models/jparacrawl/en-ja/context_model_ckpt/0-0
DATA_DIR=/home/sumire/main/contextual-mt/data/BSD-master/for_preprocess/bin/

# Training
######################################
python3 $FAIRSEQ/train.py $DATA_DIR \
    --restore-file $PRETRAINED_MODEL_FILE \
    --arch transformer \
    --optimizer adam \
    --adam-betas '(0.9, 0.98)' \
    --clip-norm 1.0 \
    --lr-scheduler inverse_sqrt \
    --warmup-init-lr 1e-07 \
    --warmup-updates 4000 \
    --lr 0.001 \
    --min-lr 1e-09 \
    --dropout 0.3 \
    --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 \
    --max-tokens 2500 \
    --max-update 28000 \
    --save-dir $MODEL_DIR \
    --no-epoch-checkpoints \
    --save-interval 10000000000 \
    --validate-interval 1000000000 \
    --save-interval-updates 100 \
    --keep-interval-updates 8 \
    --log-format simple \
    --log-interval 5 \
    --ddp-backend no_c10d \
    --update-freq 32 \
    --seed $SEED 

Error Message: AttributeError: 'NoneType' object has no attribute 'task'

Environment

Additional context

After this bug, I also tried the latest fairseq version (0.12.2) with pip install, and then replaced python3 $FAIRSEQ/train.py with fairseq-train following the fairseq documentation (https://fairseq.readthedocs.io/en/latest/command_line_tools.html#fairseq-train) too.
Then it showed different error: _[Exception: Cannot load model parameters from checkpoint /home/sumire/main/NMT_models/jparacrawl/en-ja/small_en-ja/checkpointbest.pt; please ensure that the architectures match.]

So I would like to know how to debug those 2 errors on each different environments. [AttributeError: 'NoneType' object has no attribute 'task'] _[Exception: Cannot load model parameters from checkpoint /home/sumire/main/NMT_models/jparacrawl/en-ja/small_en-ja/checkpointbest.pt; please ensure that the architectures match.]

Thanks in advance!

MorinoseiMorizo commented 1 year ago

Hi, Thank you for trying our examples.

Which version of JParaCrawl pre-trained model are you using? We are now providing both 1.0 and 3.0. I'm very sorry that I totally forgot to write it on README but we trained the 3.0 models on a different version of fairseq. If you are using 3.0, then you should use the following version of fairseq ce961a9fd26aef5130720cb6a171ddd5b51a8961.

Another considerable reason why you receive the error [Exception: Cannot load model parameters from checkpoint /home/sumire/main/NMT_models/jparacrawl/en-ja/small_en-ja/checkpoint_best.pt; please ensure that the architectures match.] is you are trying to train a small model while you are specifying the model architecture --arch transformer. If you want to fine-tune a small model, the parameter should be --arch transformer_iwslt_de_en.

I'm happy to help you if you need any further assistance. Thank you.

su0315 commented 1 year ago

Hi, Thank you for checking this and sorry for the late reply! I am using the version 3.0 of JParaCrawl pre-trained model from this link (https://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/).

Good news, I tried the fairseq version and the transformer architecture that you specified, and the error solved!

Now, I''m just getting a RuntimeError, which is RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:121, unhandled cuda error, NCCL version 2.14.3 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 'out of memory'

I will look into it. Thanks a lot!

MorinoseiMorizo commented 1 year ago

That was good news! For the out-of-memory error, one possible solution is to reduce the max-token to the size that fits into your GPU memory. https://github.com/MorinoseiMorizo/jparacrawl-finetune/blob/master/en-ja/fine-tune_kftt_fp32.sh#L83

Hope it works.