I found the bug when sync the latest TVM commits into our own repo, after that lots of our test cases are failed.
The bug will cause the case that all of the operators are annotated to executed on a custom device fail, because the "target" is set to "llvm" incorrectly instead of the custom target like "xxpu".
Even through the root cause of the failure is difficult to describe and understand, but the bug is very obvious.
Before changes of #7518.
Target GetTargetFromInteger(int64_t dev_type) {
if (targets_.size() == 1) {
// homogeneous execution.
const auto& it = targets_.begin();
return (*it).second;
} else {
// heterogeneous execution.
std::string call_dev_name;
if (dev_type == 0) {
--> call_dev_name = "llvm"; <--- Here haven't set "dev_type", "dev_type" is still equals to 0.
} else {
call_dev_name = runtime::DeviceName(dev_type);
}
if (targets_.count(dev_type) == 0) {
LOG(FATAL) << "No target is provided for device " << call_dev_name;
}
--> return targets_[dev_type]; <--- Use 0 get the correct "target".
}
}
...
target = GetTargetFromInteger(call_dev_type);
After changes of #7518.
std::string call_dev_name;
if (call_dev_type == 0) {
call_dev_name = "llvm";
---> call_dev_type = kDLCPU; <--- Here shouldn't change the value of "call_dev_type".
} else {
call_dev_name = ::tvm::runtime::DeviceName(call_dev_type);
}
if (targets_.count(call_dev_type) == 0) {
std::stringstream msg;
msg << "No target is specified for provided device name: `" << call_dev_name << "`\n\n";
....
}
LOG(FATAL) << msg.str();
}
---> target = targets_[call_dev_type]; <--- The 0th item of "targets_" is the correct target, but the "call_dev_type" is set to kDLCPU(i.e., 1), so the target is wrong.
I think the authors of #7518 is mislead by the code call_dev_name = "llvm";, if we set the target dictionary to {"cpu": "llvm", "xxpu": "xxpu"} when call "relay.build", then the size of targets_ at above code is 3 instead of 2, the 0th item of targets_ maybe is "xxpu", so here the call_dev_type can't be changed to kDLCPU.
I found the bug when sync the latest TVM commits into our own repo, after that lots of our test cases are failed.
The bug will cause the case that all of the operators are annotated to executed on a custom device fail, because the "target" is set to "llvm" incorrectly instead of the custom target like "xxpu".
Even through the root cause of the failure is difficult to describe and understand, but the bug is very obvious.
Before changes of #7518.
After changes of #7518.
I think the authors of #7518 is mislead by the code
call_dev_name = "llvm";
, if we set the target dictionary to{"cpu": "llvm", "xxpu": "xxpu"}
when call "relay.build", then the size oftargets_
at above code is 3 instead of 2, the 0th item oftargets_
maybe is "xxpu", so here the call_dev_type can't be changed to kDLCPU.@csullivan @jroesch @tqchen