apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.6k stars 3.44k forks source link

[Bug] Heterogeneous Execution Get "target" Bug Introduced by #7518 #8536

Closed Johnson9009 closed 3 years ago

Johnson9009 commented 3 years ago

I found the bug when sync the latest TVM commits into our own repo, after that lots of our test cases are failed.

The bug will cause the case that all of the operators are annotated to executed on a custom device fail, because the "target" is set to "llvm" incorrectly instead of the custom target like "xxpu".

Even through the root cause of the failure is difficult to describe and understand, but the bug is very obvious.

Before changes of #7518.

  Target GetTargetFromInteger(int64_t dev_type) {
    if (targets_.size() == 1) {
      // homogeneous execution.
      const auto& it = targets_.begin();
      return (*it).second;
    } else {
      // heterogeneous execution.
      std::string call_dev_name;
      if (dev_type == 0) {
-->     call_dev_name = "llvm";   <--- Here haven't set "dev_type", "dev_type" is still equals to 0.
      } else {
        call_dev_name = runtime::DeviceName(dev_type);
      }
      if (targets_.count(dev_type) == 0) {
        LOG(FATAL) << "No target is provided for device " << call_dev_name;
      }
 -->  return targets_[dev_type];  <--- Use 0 get the correct "target".
    }
  }
...
target = GetTargetFromInteger(call_dev_type);

After changes of #7518.

      std::string call_dev_name;
      if (call_dev_type == 0) {
        call_dev_name = "llvm";
--->    call_dev_type = kDLCPU;   <--- Here shouldn't change the value of "call_dev_type".
      } else {
        call_dev_name = ::tvm::runtime::DeviceName(call_dev_type);
      }

      if (targets_.count(call_dev_type) == 0) {
        std::stringstream msg;
        msg << "No target is specified for provided device name: `" << call_dev_name << "`\n\n";
....
        }
        LOG(FATAL) << msg.str();
      }

--->  target = targets_[call_dev_type];     <--- The 0th item of "targets_" is the correct target, but the "call_dev_type" is set to kDLCPU(i.e., 1), so the target is wrong.

I think the authors of #7518 is mislead by the code call_dev_name = "llvm";, if we set the target dictionary to {"cpu": "llvm", "xxpu": "xxpu"} when call "relay.build", then the size of targets_ at above code is 3 instead of 2, the 0th item of targets_ maybe is "xxpu", so here the call_dev_type can't be changed to kDLCPU.

@csullivan @jroesch @tqchen

comaniac commented 3 years ago

@jroesch this might be the root cause of the issue I mentioned in the forum days ago.