[17:28:39] [Model]: Loading declare-lab/flan-alpaca-large...

Sosycs commented 1 year ago

Thanks for the great work, I am trying to implement the work of this paper on google colab with 166 G disk and T4. but at the training stage for both rationale generation and answer inference I got the output:

2023-09-29 17:27:49.955571: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
args Namespace(data_root='/content/mm-cot/data', output_dir='/content/mm-cot/experiments', model='declare-lab/flan-alpaca-large', options=['A', 'B', 'C', 'D', 'E'], epoch=50, lr=5e-05, bs=2, input_len=512, output_len=512, eval_bs=4, eval_acc=None, train_split='train', val_split='val', test_split='test', use_generate=True, final_eval=False, user_msg='rationale', img_type='vit', eval_le=None, test_le=None, evaluate_dir=None, caption_file='data/instruct_captions.json', use_caption=True, prompt_format='QCM-E', seed=42)
====Input Arguments====
{
  "data_root": "/content/mm-cot/data",
  "output_dir": "/content/mm-cot/experiments",
  "model": "declare-lab/flan-alpaca-large",
  "options": [
    "A",
    "B",
    "C",
    "D",
    "E"
  ],
  "epoch": 50,
  "lr": 5e-05,
  "bs": 2,
  "input_len": 512,
  "output_len": 512,
  "eval_bs": 4,
  "eval_acc": null,
  "train_split": "train",
  "val_split": "val",
  "test_split": "test",
  "use_generate": true,
  "final_eval": false,
  "user_msg": "rationale",
  "img_type": "vit",
  "eval_le": null,
  "test_le": null,
  "evaluate_dir": null,
  "caption_file": "data/instruct_captions.json",
  "use_caption": true,
  "prompt_format": "QCM-E",
  "seed": 42
}
img_features size:  torch.Size([11208, 145, 1024])
number of train problems: 12726

number of val problems: 4241

number of test problems: 4241

[17:28:39] [Model]: Loading declare-lab/flan-alpaca-large...

and the cell stop and the expermint folder is empty. can anyone explain what is the problem for me? (I am still a new learner in the field)

cooelf commented 11 months ago

Hi, did you try to conduct a unit test to see if it is possible to load a pre-trained model using huggingface?

My guess is that the memory is not enough for loading the model.

from transformers import T5ForConditionalGeneration

# you may also try to change "declare-lab/flan-alpaca-large" to "declare-lab/flan-alpaca-base" to see if it goes well. model = T5ForConditionalGeneration.from_pretrained("declare-lab/flan-alpaca-large")

Sosycs commented 11 months ago

after a unit test by loading the model from huggingface.

(…)an-alpaca-large/resolve/main/config.json: 100%
787/787 [00:00<00:00, 56.3kB/s]
model.safetensors: 100%
3.13G/3.13G [00:16<00:00, 261MB/s]
(…)arge/resolve/main/generation_config.json: 100%
142/142 [00:00<00:00, 13.6kB/s]

I have changed the model from large to base but I have encountered the same:

2023-10-17 16:48:58.529434: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
args Namespace(data_root='/content/mm-cot/data', output_dir='/content/mm-cot/experiments', model='declare-lab/flan-alpaca-base', options=['A', 'B', 'C', 'D', 'E'], epoch=50, lr=5e-05, bs=2, input_len=512, output_len=512, eval_bs=4, eval_acc=None, train_split='train', val_split='val', test_split='test', use_generate=True, final_eval=False, user_msg='rationale', img_type='vit', eval_le=None, test_le=None, evaluate_dir=None, caption_file='data/instruct_captions.json', use_caption=True, prompt_format='QCM-E', seed=42)
====Input Arguments====
{
  "data_root": "/content/mm-cot/data",
  "output_dir": "/content/mm-cot/experiments",
  "model": "declare-lab/flan-alpaca-base",
  "options": [
    "A",
    "B",
    "C",
    "D",
    "E"
  ],
  "epoch": 50,
  "lr": 5e-05,
  "bs": 2,
  "input_len": 512,
  "output_len": 512,
  "eval_bs": 4,
  "eval_acc": null,
  "train_split": "train",
  "val_split": "val",
  "test_split": "test",
  "use_generate": true,
  "final_eval": false,
  "user_msg": "rationale",
  "img_type": "vit",
  "eval_le": null,
  "test_le": null,
  "evaluate_dir": null,
  "caption_file": "data/instruct_captions.json",
  "use_caption": true,
  "prompt_format": "QCM-E",
  "seed": 42
}
img_features size:  torch.Size([11208, 145, 1024])
number of train problems: 12726

number of val problems: 4241

number of test problems: 4241

[16:49:05] [Model]: Loading declare-lab/flan-alpaca-base...

I am using google colab T4 with high RAM

cooelf commented 4 months ago

The hanging may also be reasonable as the main process could be handling the data after loading the model (there is no signal for indicating the completion of model loading).

amazon-science / mm-cot

[17:28:39] [Model]: Loading declare-lab/flan-alpaca-large... #62