Open athitten opened 4 months ago
The image was from a bit old environment of pjnl-20240417
, which has
@xwang233 to build nemo on top of pjnl used the pjnl container (gitlab-master.nvidia.com:5005/dl/pytorch/update-scripts:pjnl-latest) from last Friday. Is the error fixed in the newest version of pjnl container ? I can try with that if thats the case. Also just to make sure, the pjnl container name is the same and it gets updated with the latest version of the softwares/packages each time right ?
I don't know how to build or import NeMo so I can't verify that.
The image pjnl-latest
is always the latest build. You can use pjnl-YYYYMMDD
to pin a dated image in your build.
triage review — @athitten can you help us understand if this happens on more recent versions of nvfuser?
@kevinstephano maybe we should update the nvfuser error message to print the Python type so we could maybe reproduce this?
This is potentially a little more weird. The error is actually suggesting that the reshape shape
is composed of something other than python integers or nvFuser Scalars suggesting that the FusionDefinition
was malformed for reshape
.
I need to add something like the following to report the type in pybind11:
void check_type(py::handle obj) {
py::handle type = py::type::handle_of(obj);
if (!type.is_none()) {
std::string type_name = static_cast<std::string>(py::str(type));
std::cout << "Object type: " << type_name << std::endl;
} else {
std::cout << "Error: Failed to get object type" << std::endl;
}
}
Tried with the latest NVFuser, was not able to reproduce this error. We can perhaps close this issue. I will open a new one if something similar comes up in the future.
🐛 Bug
Applying
thunder.jit
toconv
operation in UNet model of NeMo Stable Diffusion gives an error:To Reproduce
Steps to reproduce the behavior:
Full stack trace:
cc: @tfogal
cc @tfogal