Closed ChaiByte closed 2 years ago
Python traceback messages show that dnn apply op conv1(x)
failed:
In MegEngine/MegDNN, it's common to do shape_infer
when we hope to do something like (output,) = apply(op, inp, weight)
and dispatch the kernel to the MegDNN computing library. With the input descriptions, we could (not always work) infer the information about the output such as shape
. It's convenient in some situations, for example, if we need to get the shape of the convolution filter in the ith layer, we don't need to input the data and execute the corresponding code until the filter tensor was calculated -- try to infer that information in advance. Then get them directly when needed.
TensorShape ChannelImpl::get_shape(Handle handle) {
auto info = reinterpret_cast<TensorInfo*>(handle);
if (info->desc.layout.ndim != 0) {
return info->desc.layout;
}
TensorShape ret = wait_tensor(info, TensorProp::Shape)->layout();
return ret;
}
DType ChannelImpl::get_dtype(Handle handle) {
auto info = reinterpret_cast<TensorInfo*>(handle);
MGB_RECORD_EVENT(TensorGetPropEvent, info->id, TensorProp::DType);
auto ret = info->desc.layout.dtype;
return ret;
}
In MegDNN, filter_ndim=0
means the filter's ndim
is unknown (or we say it can not be inferred in advance). But it SHOULD NOT break our programs because we will be able to get this shape information when needed computation is done. So https://github.com/MegEngine/MegEngine/commit/df5ebd3da7495b8eb8f079651dbe980c5f4d7d37 is the solution.
It seems that the problem has been solved. But if you are still confused about why those workarounds mentioned above are valid, you can try debugging the example program. The RuntimeError is raised from MegDNN, you need gdb
, so let's do it!
Build MegEngine with -DCMAKE_BUILD_TYPE=Debug
from the source.
Run gdb python3
, catch throw
then run example.py
(might wait for seconds depending on the machine):
.../imperative/src/impl/ops/convolution.cpp:127
then r
two times;desc.layout = do_shape_infer(def, src_ndim, src, filter)
, p filter
then you'll find it's ndim
is 0
, which means that shape_infer
is failed. But when did it was started?Read the stack infos, add a breakpoint at .../imperative/src/impl/interpreter/interpreter_impl.cpp:440
;
Then r
and l
, read apply_op_impl
source code:
SmallVector<Handle> ChannelImpl::apply_op_impl(
{
MGB_LOCK_GUARD(m_info_spin);
for (auto i : inputs) {
auto info = reinterpret_cast<TensorInfo*>(i);
mgb_assert(
!info->invalid,
"an input tensor is unusable due to previous error");
input_infos.push_back(info);
input_descs.push_back(info->desc);
}
}
info->desc.layout.ndim == 0
then r
utils break here, check py-bt
:
File ".../imperative/python/megengine/functional/nn.py", line 261, in conv2d
(output,) = apply(op, inp, weight)
File ".../imperative/python/megengine/module/conv.py", line 422, in calc_conv
self.compute_mode,
File ".../imperative/python/megengine/module/conv.py", line 426, in forward
return self.calc_conv(inp, self.weight, self.bias)
File ".../imperative/python/megengine/module/module.py", line 142, in __call__
outputs = self.forward(*inputs, **kwargs)
File "example.py", line 21, in forward
x = self.conv2(x)
it means while dispatching the conv2(x)
kernel, its input lost shape info.
Read source code ChannelImpl::dispatch_kernel()
, we can find profiling info:
void ChannelImpl::dispatch_kernel(
std::shared_ptr<OpDef> op, const SmallVector<TensorInfo*>& input_infos,
const SmallVector<LogicalTensorDesc>& input_descs,
// ...
std::optional<StackManager::Guard> guard;
if (Profiler::is_profiling()) {
guard.emplace(op->trait()->make_name(*op), &state.stack_manager);
}
auto [output_descs, validated] =
OpDef::infer_output_attrs_fallible(*op, input_descs);
MGB_RECORD_EVENT(ShapeInferEvent, validated);
if (Profiler::is_profiling()) {
// ...
}
}
So using the profiler might be helpful. Following the document we will get the picture:
Note that nr_shape_infer_failure
is recorded, so we need to focus on when it increases the first time.
Now we find the answer, the Resize
op (interpolate) can not infer shape in advance.
We can insert a reshape
op between interpolation
and conv2
ops as follows:
class SimpleModel(M.Module):
#...
def forward(self, x):
x = self.conv1(x)
x = F.nn.interpolate(x, scale_factor=1, mode="nearest").reshape(1, 1, 2, 2)
x = self.conv2(x)
Run the program again, congratulations! Everything is fine.
Description
Training with a single GPU & using gradient clipping in this codebase will cause an error in MegEngine 1.9.x version. After 1 iteration with auto diff & parameter update, the next time the model do forward will break. Error message:
Here is the simplest example to reproduce this problem:
Workaround
Solution 1: Comment this line in
megdiffusion.scripts.train
:Then we can train the model without clipping grad. ( But it's not expected... π£ )
Solution 2: This situation did not happen when using distributed training.
Solution 3: Try changing
loss = model(x) + 0
toloss = model(x)
π€π€π€Solution 4: Try deleting
x = F.nn.interpolate(x, scale_factor=1, mode="nearest")
π€π€π€Issue Track
This problem was fixed in https://github.com/MegEngine/MegEngine/commit/df5ebd3da7495b8eb8f079651dbe980c5f4d7d37 so you can wait for the release of MegEngine v1.10 or build MegEngine dev latest than this commit from the source.