Closed lichun-wang closed 1 year ago
Thank you, I can reproduce. Seems related to the new CG pool (copy_to_pool
), Will look into it.
def copy_to_pool(self, t: torch.Tensor) -> torch.Tensor:
"""
Copy the tensor t in the pool and return a tensor that is a view of the pool.
:param t: tensor to copy in the pool
:return: tensor copy (that is a view of the pool)
"""
assert t.device == self.pool.device
assert self.can_store(t)
# 64 bits alignment
tensor_aligned_size = get_aligned_size(t)
new_offset = self.offset + tensor_aligned_size
# offset is expressed in t.dtype number of elements
new_t = torch.as_strided(
self.pool.view(t.dtype), size=t.size(), stride=t.stride(), storage_offset=self.offset // t.element_size()
)
print(f"t info: {t.size()}, {t.stride()}, {t.element_size()}, {len(t.untyped_storage())}")
print(f"new_t info: {new_t.size()}, {new_t.stride()}, {new_t.element_size()}, {len(new_t.untyped_storage())}")
new_t.copy_(t)
self.offset = new_offset
return new_t
It happens inside CG pool management, during the copy of input tensors. After debugging, it appears that for some unrelated reason, one of the dim of the stride of input tensor is 0. It doesn't matter as the related shape dimension is 1, so no one will iterate over it... But we use those strides (including the 0) to create a view of the pool tensor and PyTorch raise the issue.
Seems to be related to a dimension having a 1 size, and the related stride being 0. During the creation of a view of the CUDA pool, PyTorch don't like the 0 and makes some crazy stuff.
Same kind of behaviour is happening here: https://github.com/pytorch/pytorch/issues/33812#issuecomment-593127474 Just removing 0 by something else fix it.
@lichun-wang can you try #291 and report here? On my side, notebook works e2e
Description
I run the bert e2e example in the tutorial , i meet a runtimeError.
When I run the original code, it's ok, but if i change batch dim from 1 to 4 in this line :' shapes = [(1, w) for w in range(8, 128 + 8, 8)] ' , I got an error as below:
RuntimeError: unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.
example code as below
Steps to reproduce
all in description
Expected Behavior
all in description
Actual Behavior
all in description
Your environment
Self-service
Code of Conduct