Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.55k stars 242 forks source link

Question about wandb: ERROR Error while calling W&B API. #219

Closed lloo099 closed 1 year ago

lloo099 commented 1 year ago

Hi, I encounter a compile problem with wandb. My wandb account was signed and released an open project here (https://wandb.ai/zhoutomas177/OTTER-LLaMA7B). But the error was shown when I run the code:

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml \
pipeline/train/instruction_following.py \
--pretrained_model_name_or_path=/home/jjc/Otter/OTTER-LLaMA7B-Init  \
--mimicit_path="/home/jjc/Otter/mimic-it/mimicit_data/DC/DC_instructions.json" \
--images_path="/home/jjc/Otter/mimic-it/convert-it/datasets/coco2017/LA_pesudo.json" \
--train_config_path="/home/jjc/Otter/mimic-it/mimicit_data/DC/DC_train.json" \
--batch_size=2 \
--num_epochs=9 \
--report_to_wandb \
--wandb_entity=ntu-slab \
--run_name=OTTER-LLaMA7B-densecaption \
--wandb_project=OTTER-LLaMA7B \
--workers=1 \
--lr_scheduler=cosine \
--learning_rate=1e-5 \
--warmup_steps_ratio=0.01 \

API Error is here:

Extension horovod.torch has not been built: /home/jjc/miniconda3/envs/otter/lib/python3.9/site-packages/horovod/torch/mpi_lib_v2.cpython-39-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Warning! MPI libs are missing, but python applications are still available.
Loading pretrained model from /home/jjc/Otter/OTTER-LLaMA7B-Init
You are using a model of type flamingo to instantiate a model of type otter. This is not supported for all configurations of models and can yield errors.
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
Using pad_token, but it is not set yet.
The current model version is configured for Otter-Image with max_num_frames set to None.
Trainable param: 1.44 B
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:11<00:00,  2.92s/it]
Start running training on rank 0.
Total training steps: 37035
wandb: Currently logged in as: zhoutomas177. Use `wandb login --relogin` to force relogin
wandb: ERROR Error while calling W&B API: permission denied (<Response [403]>)
Problem at: /home/jjc/Otter/pipeline/train/instruction_following.py 468 main
wandb: ERROR It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 403: Forbidden)
Traceback (most recent call last):
  File "/home/jjc/Otter/pipeline/train/instruction_following.py", line 537, in <module>
    main()
  File "/home/jjc/Otter/pipeline/train/instruction_following.py", line 468, in main
    wandb.init(
  File "/home/jjc/miniconda3/envs/otter/lib/python3.9/site-packages/wandb/sdk/wandb_init.py", line 1173, in init
    raise e
  File "/home/jjc/miniconda3/envs/otter/lib/python3.9/site-packages/wandb/sdk/wandb_init.py", line 1154, in init
    run = wi.init()
  File "/home/jjc/miniconda3/envs/otter/lib/python3.9/site-packages/wandb/sdk/wandb_init.py", line 770, in init
    raise error
wandb.errors.CommError: It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 403: Forbidden)
Traceback (most recent call last):
  File "/home/jjc/miniconda3/envs/otter/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/jjc/miniconda3/envs/otter/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/jjc/miniconda3/envs/otter/lib/python3.9/site-packages/accelerate/commands/launch.py", line 979, in launch_command
    simple_launcher(args)
  File "/home/jjc/miniconda3/envs/otter/lib/python3.9/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/jjc/miniconda3/envs/otter/bin/python', 'pipeline/train/instruction_following.py', '--pretrained_model_name_or_path=/home/jjc/Otter/OTTER-LLaMA7B-Init', '--mimicit_path=/home/jjc/Otter/mimic-it/mimicit_data/DC/DC_instructions.json', '--images_path=/home/jjc/Otter/mimic-it/convert-it/datasets/coco2017/LA_pesudo.json', '--train_config_path=/home/jjc/Otter/mimic-it/mimicit_data/DC/DC_train.json', '--batch_size=2', '--num_epochs=9', '--report_to_wandb', '--wandb_entity=ntu-slab', '--run_name=OTTER-LLaMA7B-densecaption', '--wandb_project=OTTER-LLaMA7B', '--workers=1', '--lr_scheduler=cosine', '--learning_rate=1e-5', '--warmup_steps_ratio=0.01']' returned non-zero exit status 1.

Does anyone encounter this issue glad to help me. Thanks,

Luodian commented 1 year ago

I think you can replace the --wandb_entity to your personal information. You can refer the wandb documentation for details. We use ntu-slab since it's our team project. For personal use, you can refer to ur own identity.

image

lloo099 commented 1 year ago

I think you can replace the --wandb_entity to your personal information. You can refer the wandb documentation for details. We use ntu-slab since it's our team project. For personal use, you can refer to ur own identity.

image

Thank you. The solution you provided worked. However, I am experiencing issues with the CUDA memory. I am currently running the Otter-LLaMa7B model on three 3090 GPUs, but it is running out of memory. Do you happen to know how I can make it run on three 3090 GPUs? I have already reduced the batch size and am using bf16.

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml \
pipeline/train/instruction_following.py \
--pretrained_model_name_or_path=/home/jjc/Otter/OTTER-LLaMA7B-Init  \
--mimicit_path="/home/jjc/Otter/mimic-it/mimicit_data/LA/LACONV_instructions.json" \
--images_path="/home/jjc/Otter/mimic-it/convert-it/output/LA.json" \
--train_config_path="/home/jjc/Otter/mimic-it/mimicit_data/LA/LACONV_train.json" \
--batch_size=1 \
--num_epochs=1 \
--report_to_wandb \
--wandb_entity=hkujjc \
--run_name=OTTER-LLaMA7B-densecaption \
--wandb_project=OTTER-LLaMA7B \
--workers=1 \
--lr_scheduler=cosine \
--learning_rate=1e-5 \
--warmup_steps_ratio=0.01 \

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 23.70 GiB total capacity; 22.31 GiB already allocated; 5.75 MiB free; 22.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Luodian commented 1 year ago

Can you try to make sure you modified the OtterForConditionalGeneration related code in instruction_following.py?

You need to pass a **precision args to make sure it's loaded in bf16 version.

You could project-wise search **precision to see how we handle it.

lloo099 commented 1 year ago

Can you try to make sure you modified the OtterForConditionalGeneration related code in instruction_following.py?

You need to pass a **precision args to make sure it's loaded in the bf16 version.

You could project-wise search **precision to see how we handle it.

Cool, thanks. I can load the 16-bit model following ur **precision settings. But the input dtype doesn't match with 16b. As I already changed the dtype of autocast() and cast().

Luodian commented 1 year ago

You could also follow our code in demo, to set the dtype of input tensors.

lloo099 commented 1 year ago

Thanks