add: 2 variants of multi query implementation; printing some details

Added 2 implementations variants of implementation of multi query attention controlled by attention_type parameter
- AttentionType.MULTI_QUERY with minimal changes to the code
- AttentionType.MULTI_QUERY_1 with some reordering of dimensions from explorations with @harm-devries and bmm instead of matmul.
Added some details printing with a parameter print_details

profiling details:

-------------------- attention_type == AttentionType.MULTI_QUERY--------------------- {'get_test_batch': 5.9604644775390625e-05, 'generate_text_batch': 18.453815460205078, 'input_batch_size': 8, 'input_batch_length': 16, 'max_gen_length': 1024, 'num_beams': 1, 'do_sample': False, 'pad_token_id': 50256, 'dtype': torch.int64, 'device': device(type='cuda'), 'cuda_device_name': 'Tesla V100-PCIE-16GB-LS'} -------------------- attention_type == AttentionType.MULTI_QUERY_1--------------------- {'get_test_batch': 4.172325134277344e-05, 'generate_text_batch': 15.190143346786499, 'input_batch_size': 8, 'input_batch_length': 16, 'max_gen_length': 1024, 'num_beams': 1, 'do_sample': False, 'pad_token_id': 50256, 'dtype': torch.int64, 'device': device(type='cuda'), 'cuda_device_name': 'Tesla V100-PCIE-16GB-LS'} -------------------- attention_type == AttentionType.MULTI_HEAD--------------------- {'get_test_batch': 5.459785461425781e-05, 'generate_text_batch': 19.78107237815857, 'input_batch_size': 8, 'input_batch_length': 16, 'max_gen_length': 1024, 'num_beams': 1, 'do_sample': False, 'pad_token_id': 50256, 'dtype': torch.int64, 'device': device(type='cuda'), 'cuda_device_name': 'Tesla V100-PCIE-16GB-LS'}

bigcode-project / transformers

add: 2 variants of multi query implementation; printing some details #2