ELS-RD / kernl

Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
http://www.kernl.ai
Apache License 2.0
1.53k stars 95 forks source link

I run the bert e2e example, if batch is not 1, I get an error!!! #286

Closed lichun-wang closed 1 year ago

lichun-wang commented 1 year ago

Description

I run the bert e2e example in the tutorial , i meet a runtimeError.

When I run the original code, it's ok, but if i change batch dim from 1 to 4 in this line :' shapes = [(1, w) for w in range(8, 128 + 8, 8)] ' , I got an error as below:

RuntimeError: unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.

example code as below

from kernl.model_optimization import optimize_model

optimize_model(model_opt)
start = time.perf_counter()
shapes = [(1, w) for w in range(8, 128 + 8, 8)]    ## when i change 1 to 4, I got an error!!!
with torch.inference_mode(), torch.cuda.amp.autocast(enabled=True, dtype=torch.float16, cache_enabled=True):
    for s in shapes:
        inputs = {
            "input_ids": torch.ones(s, device="cuda", dtype=torch.long),
            "attention_mask": torch.ones(s, device="cuda", dtype=torch.long),
        }
        _ = model_opt(**inputs)
        _ = model(**inputs)

print(f"{time.perf_counter() - start:.0f}s")

Steps to reproduce

all in description

Expected Behavior

all in description

Actual Behavior

all in description

Your environment

Self-service

Code of Conduct

pommedeterresautee commented 1 year ago

Thank you, I can reproduce. Seems related to the new CG pool (copy_to_pool), Will look into it.

    def copy_to_pool(self, t: torch.Tensor) -> torch.Tensor:
        """
        Copy the tensor t in the pool and return a tensor that is a view of the pool.
        :param t: tensor to copy in the pool
        :return: tensor copy (that is a view of the pool)
        """

        assert t.device == self.pool.device
        assert self.can_store(t)
        # 64 bits alignment
        tensor_aligned_size = get_aligned_size(t)
        new_offset = self.offset + tensor_aligned_size
        # offset is expressed in t.dtype number of elements
        new_t = torch.as_strided(
            self.pool.view(t.dtype), size=t.size(), stride=t.stride(), storage_offset=self.offset // t.element_size()
        )
        print(f"t info: {t.size()}, {t.stride()}, {t.element_size()}, {len(t.untyped_storage())}")
        print(f"new_t info: {new_t.size()}, {new_t.stride()}, {new_t.element_size()}, {len(new_t.untyped_storage())}")
        new_t.copy_(t)
        self.offset = new_offset
        return new_t
pommedeterresautee commented 1 year ago

It happens inside CG pool management, during the copy of input tensors. After debugging, it appears that for some unrelated reason, one of the dim of the stride of input tensor is 0. It doesn't matter as the related shape dimension is 1, so no one will iterate over it... But we use those strides (including the 0) to create a view of the pool tensor and PyTorch raise the issue.

Seems to be related to a dimension having a 1 size, and the related stride being 0. During the creation of a view of the CUDA pool, PyTorch don't like the 0 and makes some crazy stuff.

Same kind of behaviour is happening here: https://github.com/pytorch/pytorch/issues/33812#issuecomment-593127474 Just removing 0 by something else fix it.

@lichun-wang can you try #291 and report here? On my side, notebook works e2e