Closed PYMAQ closed 3 years ago
Please modify the path with your actual path to pre-trained BART model.
thank you.it works. but when i training Multi-View model,and then it come to my eyes: OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 9.78 GiB already allocated; 21.75 MiB free; 9.97 GiB reserved in total by PyTorch)
The GPU are GeForceRTX2080Ti * 2 which is 11GB(every GPU) and i set CUDA_VISIBLE_DEVICES=0,1 but it didn‘t work
Maybe i need a GPU which is more than 16GB??but i don’t know why two GPU totally 22GB did not work .oh my god.
thanks for everything.
It means that you need to use GPU with larger memories.
The multi-GPU does not work in this way. Please check the relevant docs (fairseq/pytorch).
Multi-GPU can help you expand the batch size, but can not adapt to max_seq_len in generation tasks. Basically, multi-GPU helps you assign different training data in a batch to different GPUs and then aggregate them to achieve a larger batch size.
话说,这个可以部署中文对话数据集吗emmm(大佬是浙大的呀
You could do that, but you need a Chinese-pre-trained model. Maybe mBART could work.
thank you.it works. but when i training Multi-View model,and then it come to my eyes: OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 9.78 GiB already allocated; 21.75 MiB free; 9.97 GiB reserved in total by PyTorch)
The GPU are GeForceRTX2080Ti * 2 which is 11GB(every GPU) and i set CUDA_VISIBLE_DEVICES=0,1 but it didn‘t work
Maybe i need a GPU which is more than 16GB??but i don’t know why two GPU totally 22GB did not work .oh my god.
thanks for everything.
BART_PATH= PATH-TO-BART-MODEL (./bart.large/model.pt)我改成BART_PATH= (./bart.large/model.pt)为什么bash train_single_view.sh显示train.py: error: argument --restore-file: expected one argument,请问该怎么修改呢
不是这样改,应该是BART_PATH=”./bart.large/model.pt”
不是这样改,应该是BART_PATH=”./bart.large/model.pt”
train_single_view.sh: line 6: /content/drive/MyDrive/Multi-View-Seq2Seq/train_sh/bart.large/model.pt: Permission denied usage: train.py [-h] [--no-progress-bar] [--log-interval N] [--log-format {json,none,simple,tqdm}] [--tensorboard-logdir DIR] [--seed N] [--cpu] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale D] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--multi-views] [--balance] [--lr-weight LR_WEIGHT] [--T T] [--criterion {adaptive_loss,binary_cross_entropy,composite_loss,cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,legacy_masked_lm_loss,masked_lm,nat_loss,sentence_prediction,sentence_ranking}] [--tokenizer {moses,nltk,space}] [--bpe {fastbpe,gpt2,bert,sentencepiece,subword_nmt}] [--optimizer {adadelta,adafactor,adagrad,adam,adamax,lamb,nag,sgd}] [--lr-scheduler {cosine,fixed,inverse_sqrt,polynomial_decay,reduce_lr_on_plateau,tri_stage,triangular}] [--task TASK] [--num-workers N] [--skip-invalid-size-inputs-valid-test] [--max-tokens N] [--max-sentences N] [--required-batch-size-multiple N] [--dataset-impl FORMAT] [--train-subset SPLIT] [--valid-subset SPLIT] [--validate-interval N] [--fixed-validation-seed N] [--disable-validation] [--max-tokens-valid N] [--max-sentences-valid N] [--curriculum N] [--distributed-world-size N] [--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--distributed-no-spawn] [--ddp-backend {c10d,no_c10d}] [--bucket-cap-mb MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync] [--broadcast-buffers] [--arch ARCH] [--max-epoch N] [--max-update N] [--clip-norm NORM] [--sentence-avg] [--update-freq N1,N2,...,N_K] [--lr LR_1,LR_2,...,LR_N] [--min-lr LR] [--use-bmuf] [--save-dir DIR] [--restore-file RESTORE_FILE] [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer] [--optimizer-overrides DICT] [--save-interval N] [--save-interval-updates N] [--keep-interval-updates N] [--keep-last-epochs N] [--keep-best-checkpoints N] [--no-save] [--no-epoch-checkpoints] [--no-last-checkpoints] [--no-save-optimizer-state] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric] [--patience N] train.py: error: argument --restore-file: expected one argument还是一样地问题,BART_PATH=”./bart.large/model.pt”后面是不是还有一个文件地址
你有下载bart模型吗 要放好位置噢
你有下载bart模型吗 要放好位置噢
谢谢,已经可以跑了
你有下载bart模型吗 要放好位置噢
| INFO | fairseq.trainer | no existing checkpoint found ”./bart.large/model.pt” 我把下载地model.pt文件放在/content/drive/MyDrive/Multi-View-Seq2Seq/train_sh/bart.large/model.pt,为啥说找不到啊
你有下载bart模型吗 要放好位置噢
而且我换成绝对路径也是找不到,,,
你有下载bart模型吗 要放好位置噢
| INFO | fairseq.trainer | no existing checkpoint found ”./bart.large/model.pt” 我把下载地model.pt文件放在/content/drive/MyDrive/Multi-View-Seq2Seq/train_sh/bart.large/model.pt,为啥说找不到啊
ememem,文件路径打成中文引号了,,,
I encountered a little problem. After deploying all the resources, I found that I could not run this program. Could you please help me check it? thank you
when I Training Single-View model,and then it come to my eyes.
./train_single_view.sh: line 6: syntax error near unexpected token
(' ./train_single_view.sh: line 6:
BART_PATH= PATH-TO-BART-MODEL (./bart.large/model.pt)'