The at::MemoryFormat::Preserve is represent for that tensor remains consistent across different operations or transformations. And it is meaningless in this line, as user has to determine the specific memory layout of the empty tensor that just created
at::Tensor out = nodispatch::empty(output_size, input.options(), at::MemoryFormat::Preserve);
In torch's source code, there is a logic to check this error:
void empty_tensor_restride(MemoryFormat memory_format) {
if (has_symbolic_sizes_strides_) {
empty_tensor_restride_symint(memory_format);
return;
}
#ifdef DEBUG
TORCH_INTERNAL_ASSERT(
compute_numel() == numel_,
"If you are seeing this error, that means empty_tensor_restride was "
"called before setting correct numel");
#endif
switch (memory_format) {
case MemoryFormat::Contiguous: {
// dim_ is a virtual call, don't repeat it
const auto dim_ = dim();
sizes_and_strides_.resize(dim_);
if (dim_ > 0) {
bool overflowed = false;
const auto last_idx = dim_ - 1;
sizes_and_strides_.stride_at_unchecked(last_idx) = 1;
for (auto i = last_idx - 1; i >= 0; --i) {
overflowed |= c10::mul_overflows(
sizes_and_strides_.stride_at_unchecked(i + 1),
std::max<int64_t>(
sizes_and_strides_.size_at_unchecked(i + 1), 1),
std::addressof(sizes_and_strides_.stride_at_unchecked(i)));
}
TORCH_CHECK(!overflowed, "Stride calculation overflowed");
}
break;
}
case MemoryFormat::ChannelsLast: {
TORCH_CHECK(
dim() == 4, "required rank 4 tensor to use channels_last format");
set_sizes_and_strides(sizes(), get_channels_last_strides_2d(sizes()));
break;
}
case MemoryFormat::ChannelsLast3d: {
TORCH_CHECK(
dim() == 5,
"required rank 5 tensor to use channels_last_3d format");
set_sizes_and_strides(sizes(), get_channels_last_strides_3d(sizes()));
break;
}
case MemoryFormat::Preserve:
TORCH_CHECK(false, "unsupported memory format ", memory_format);
// Cleaning warning messages, no need to break as TORCH_CHECK(false)
// terminates flow.
// break;
case MemoryFormat::NumOptions:
TORCH_INTERNAL_ASSERT(false, "invalid memory format ", memory_format);
}
// recompute contiguous flag, as currently NHWC/NCHW flags are not mutually
// exclusive see #24090
refresh_contiguous();
}
Further debugging revealed that the code generation errors were caused by a combination of issues with the code generation logic and the ascend convert config file.
class ConvertConfig(object):
# This class is used to load and parse the convert_config.yaml
def __init__(self, config_yaml):
self.convert_dict = dict()
self.convert_config_yaml = config_yaml
self.default_layout = "empty"
assert isinstance(config_yaml, list)
for config in config_yaml:
assert isinstance(config, dict)
for interface in config.keys():
if interface == "common_config":
detail = config[interface]
assert isinstance(detail, dict)
if "layout" in detail:
self.default_layout = self.layout2memoryformat(detail["layout"])
pass
# may add common behavior
for interface in config.keys():
if interface != "common_config":
self.convert_dict.setdefault(interface, dict())
detail = config[interface]
assert isinstance(detail, dict)
if "layout" in detail:
self.convert_dict[interface]["layout"] = (
self.layout2memoryformat(detail["layout"])
)
def layout2memoryformat(self, layout):
# used when pasing convert_config.yaml, return the memory format based on NCHW/NHWC and other layout.
assert isinstance(layout, str)
if "NCHW" in layout:
return "contiguous"
if "NLC" in layout:
return "channellast"
if "NHWC" in layout:
return "channellast"
if "NDHWC" in layout:
return "channellast"
return "preserve"
def interface2memoryformat(self, interface):
# return the prefered memory format based on the DIOPI interface.
interface_stripped = interface.strip().split("(")[0]
if (interface_stripped not in self.convert_dict) or (
"layout" not in self.convert_dict[interface_stripped]
):
return self.default_layout
else:
return self.convert_dict[interface_stripped]["layout"]
It will read op memory format param from ascend/convert_config.yaml, and if an op is not specified, it will use the default memory format from common_config in the yaml file.
And the problem is here, after add lines "layout: ND", it will make the code generator not using common_config, which is contiguous. Instead, it will generate conv2d code with memory_format Preserve, which caused this bug.
In order to test and run ascend CI train-one-iter for models contain conv2d, I need to delete these lines "layout: ND", as a preliminary solution to this bug. @jingguo-st
Use cases (Optional)
BC-breaking (Optional)
Checklist
Before PR:
[x] I have read and followed the workflow indicated in the Contributors.md to create this PR.
[x] Pre-commit or linting tools indicated in Contributors.md are used to fix the potential lint issues.
[ ] Bug fixes are covered by unit tests, the case that causes the bug should be added in the unit tests.
[ ] New functionalities are covered by complete unit tests. If not, please add more unit test to ensure the correctness.
[ ] The documentation has been modified accordingly, including docstring or example tutorials.
After PR:
[x] CLA has been signed and all committers have signed the CLA in this PR.
Motivation and Context
This pr is to fix a issue introduced by commit 1647d7e9
Description
After commit 1647d7e9, the following basic training code will not work
It will generate a runtime error
And this bug caused all models with conv2d layers to fail to train properly !!!
After problem diagnosis, I found that the bug was caused by the code generated autogen_diopi_wrapper.py
The error code it generated is as follows:
The at::MemoryFormat::Preserve is represent for that tensor remains consistent across different operations or transformations. And it is meaningless in this line, as user has to determine the specific memory layout of the empty tensor that just created
In torch's source code, there is a logic to check this error:
Further debugging revealed that the code generation errors were caused by a combination of issues with the code generation logic and the ascend convert config file.
In op_memory_format_converter.py, there is a logic to get memory format param from config file:
It will read op memory format param from ascend/convert_config.yaml, and if an op is not specified, it will use the default memory format from common_config in the yaml file.
And the problem is here, after add lines "layout: ND", it will make the code generator not using common_config, which is contiguous. Instead, it will generate conv2d code with memory_format Preserve, which caused this bug.
In order to test and run ascend CI train-one-iter for models contain conv2d, I need to delete these lines "layout: ND", as a preliminary solution to this bug. @jingguo-st
Use cases (Optional)
BC-breaking (Optional)
Checklist
Before PR:
After PR: