FMInference / DejaVu

268 stars 32 forks source link

AttributeError: 'OPTAttention' object has no attribute 'fp_query' #4

Closed bilgeacun closed 11 months ago

bilgeacun commented 11 months ago

Ran into the below error when collecting training data with run_infer_opt_175b_collect_sp_data.sh script:

Last thing that's printed out is:

178 <inference_batch> rank-<0> enter computation!
179 <inference_batch> rank-<1> enter computation!
180 Compute prompt seq< 0 >.
181 Compute prompt seq< 0 >.

Here is the error:

 1 Traceback (most recent call last):
  2   File "/private/home/acun/DejaVu/Decentralized_FM_alpha/dist_inference_runner.py", line 111, in <module>
  3     main()
  4   File "/private/home/acun/DejaVu/Decentralized_FM_alpha/dist_inference_runner.py", line 97, in main
  5     distributed_inference_mask_iter(args, pipe, device, request_processor)
  6   File "/private/home/acun/DejaVu/Decentralized_FM_alpha/utils/dist_inference_utils.py", line 58, in distributed_inference_mask_iter
  7     current_iter_time = pipeline.inference_batch(input_ids, output_ids_list, attention_mask=attention_mask)
  8                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  9   File "/private/home/acun/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 822, in inference_batch
 10     self.forward_seq_pipeline_stage(
 11   File "/private/home/acun/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 652, in forward_seq_pipeline_stage
 12     self._forward_compute_prompt_seq(
 13   File "/private/home/acun/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 455, in _forward_compute_prompt_seq
 14     current_emb, caches[layer_index] = self.layers[
 15                                        ^^^^^^^^^^^^
 16   File "/private/home/acun/.conda/envs/dejavu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
 17     return forward_call(*args, **kwargs)
 18            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 19   File "/private/home/acun/DejaVu/Decentralized_FM_alpha/modules/hf_opt_module_save.py", line 508, in forward
 20     hidden_states, _, present = self.self_attn(
 21                                 ^^^^^^^^^^^^^^^
 22   File "/private/home/acun/.conda/envs/dejavu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
 23     return forward_call(*args, **kwargs)
 24            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 25   File "/private/home/acun/DejaVu/Decentralized_FM_alpha/modules/hf_opt_module_save.py", line 237, in forward
 26     if self.fp_i < self.fp_query.shape[0]:
 27                    ^^^^^^^^^^^^^
 28   File "/private/home/acun/.conda/envs/dejavu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
 29     raise AttributeError("'{}' object has no attribute '{}'".format(
 30 AttributeError: 'OPTAttention' object has no attribute 'fp_query'

Any idea what could be causing this?

bilgeacun commented 11 months ago

@lzcemma We're trying to get a comparison with Dejavu for a paper submission and have limited time. We would greatly appreciate if you could take a look at this issue soon.

lzcemma commented 11 months ago

I just updated the repo. Can you check if it works for you?

bilgeacun commented 11 months ago

@lzcemma thanks for the fix. I am getting a different error related to shape mismatch now:

Compute generate token step < 0 >.
Compute generate token step < 1 >.

IndexError: The shape of the mask [2039] at index 0 does not match the shape of the indexed tensor [1, 2048] at index 0
Traceback (most recent call last):
  File "/private/home/acun/DejaVu/Decentralized_FM_alpha/dist_inference_runner.py", line 111, in <module>
    main()
  File "/private/home/acun/DejaVu/Decentralized_FM_alpha/dist_inference_runner.py", line 84, in main
    distributed_inference_mask_iter(args, pipe, device, request_processor)
  File "/private/home/acun/DejaVu/Decentralized_FM_alpha/utils/dist_inference_utils.py", line 73, in distributed_inference_mask_iter
    current_iter_time = pipeline.inference_batch(input_ids, output_ids_list, attention_mask=attention_mask)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/acun/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 830, in inference_batch
    self.forward_new_token_pipeline_stage(attention_mask=attention_mask)
  File "/private/home/acun/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 711, in forward_new_token_pipeline_stage
    self.forward_new_token_pipeline_step(step, attention_mask=attention_mask)
  File "/private/home/acun/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 756, in forward_new_token_pipeline_step
    self._forward_compute_generate_token(i, mask=attention_masks[i])
  File "/private/home/acun/DejaVu/Decentralized_FM_alpha/pipeline_parallel/dist_pipeline_inference_mask_greedy_token_pipe_sync.py", line 567, in _forward_compute_generate_token
    current_emb, cache = self.layers["block" + str(layer_index)](
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/acun/.conda/envs/dejavu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/home/acun/DejaVu/Decentralized_FM_alpha/modules/hf_opt_module_save.py", line 477, in forward
    _hidden_states = hidden_states.view(-1, hidden_states.size(-1))[
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: The shape of the mask [2039] at index 0 does not match the shape of the indexed tensor [1, 2048] at index 0
lzcemma commented 11 months ago

I guess the generation length of c4 data is not set to 0. I updated the getdata.py to reflect that.

bilgeacun commented 11 months ago

Thanks, it seems to work now, run_infer_opt_175b_collect_sp_data.sh got completed and I get mmap outputs.