deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.1k stars 651 forks source link

Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. #2240

Open adepase opened 1 year ago

adepase commented 1 year ago

Description

Running the code specified below I get a number of warning at the beginning, apparently harmless.

Training: 0% |= | Accuracy: , SoftmaxCrossEntropyLoss: Training: 0% |= | Accuracy: , SoftmaxCrossEntropyLoss: [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:336] Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback (function runCudaFusionGroup) [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:336] Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback (function runCudaFusionGroup) [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:336] Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback (function runCudaFusionGroup) [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:336] Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback (function runCudaFusionGroup) [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:336] Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback (function runCudaFusionGroup) [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:336] Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback (function runCudaFusionGroup)

Training: 0% |= | Accuracy: , SoftmaxCrossEntropyLoss: , speed: 129,14 items/sec Training: 0% |= | Accuracy: , SoftmaxCrossEntropyLoss: , speed: 24,89 items/sec Training: 1% |= | Accuracy: 0,01, SoftmaxCrossEntropyLoss: 5,06, speed: 154,95 items/sec Training: 2% |= | Accuracy: 0,01, SoftmaxCrossEntropyLoss: 5,06, speed: 145,35 items/sec

I set the env variable, as requested and I report the full error message in the corresponding section. Are they really harmless or should I worry?

Expected Behavior

No warning if harmless, (or a clearer warning), else correction

Error Message

Training: 0% |= | Accuracy: , SoftmaxCrossEntropyLoss: Training: 0% |= | Accuracy: _, SoftmaxCrossEntropyLoss: _ai.djl.engine.EngineException: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: false INTERNAL ASSERT FAILED at "C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\executor_utils.cpp":1181, please report a bug to PyTorch. namespace CudaCodeGen {

typedef signed char int8_t; typedef unsigned char uint8_t; typedef short int int16_t; typedef unsigned short int uint16_t; typedef int int32_t; typedef unsigned int uint32_t; typedef long long int int64_t; typedef unsigned long long int uint64_t; typedef int nvfuser_index_t;

define POS_INFINITY __int_as_float(0x7f800000)

define INFINITY POS_INFINITY

define NEG_INFINITY __int_as_float(0xff800000)

define NAN __int_as_float(0x7fffffff)

namespace std {

template _Tp&& declval(int); template _Tp declval(long); template decltype(__declval<_Tp>(0)) declval() noexcept;

template <class _Tp, _Tp v> struct integral_constant { static const _Tp value = v; typedef _Tp value_type; typedef integral_constant type; };

typedef integral_constant<bool, true> true_type; typedef integral_constant<bool, false> false_type;

// is_same, functional template <class _Tp, class _Up> struct is_same : public false_type {}; template struct is_same<_Tp, _Tp> : public true_type {};

// is_integral, for some types. template struct is_integral : public integral_constant<bool, false> {};

[**** OMISSIS: I received an error: Comment is too long (maximum is 65536 characters)] SO I CUT MANY LINES ****]

NVFUSER_UPDATE_MAGIC_ZERO if ((((((nvfuser_index_t)threadIdx.x) 4) + 3) < T0.size[0])) { loadLocalToGlobal<float, 4, false>( &T18[(((nvfuser_index_t)blockIdx.x) T0.size[0]) + i256], &T23[0]); } } } }

CUDA NVRTC compile error: nvrtc: error: failed to open nvrtc-builtins64_117.dll. Make sure that nvrtc-builtins64_117.dll is installed correctly.

at ai.djl.pytorch.jni.PyTorchLibrary.moduleForward(Native Method) at ai.djl.pytorch.jni.IValueUtils.forward(IValueUtils.java:47) at ai.djl.pytorch.engine.PtSymbolBlock.forwardInternal(PtSymbolBlock.java:154) at ai.djl.nn.AbstractBaseBlock.forwardInternal(AbstractBaseBlock.java:128) at ai.djl.nn.AbstractBaseBlock.forward(AbstractBaseBlock.java:93) at ai.djl.nn.SequentialBlock.forwardInternal(SequentialBlock.java:209) at ai.djl.nn.AbstractBaseBlock.forward(AbstractBaseBlock.java:93) at ai.djl.training.Trainer.forward(Trainer.java:189) at ai.djl.training.EasyTrain.trainSplit(EasyTrain.java:122) at ai.djl.training.EasyTrain.trainBatch(EasyTrain.java:110) at ai.djl.training.EasyTrain.fit(EasyTrain.java:58) at it.algaware.mrjvs.djl.Test2.getIntentAll(Test2.java:239) at it.algaware.mrjvs.djl.Test2.main(Test2.java:82)

How to Reproduce?

I run the code I already posted in https://github.com/deepjavalibrary/djl/issues/2144#issuecomment-1356405023 , but with the following change in loading the modelPath (with distilbert it seems to work, the above issue is only with bert and it seems to me there is no relation among the two.

.optModelPath(Paths.get("build/pytorch/traced_distilbert_wikipedia_uncased"))
//.optModelPath(Paths.get("build/pytorch/bert/bertBase"))

Steps to reproduce

Create class, run main

What have you tried to solve it?

Nothing, it seems harmless, just tracing and asking

Environment Info

Please run the command ./gradlew debugEnv from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below:

I receive an error from terminal, but I'm working from eclipse. If you need other information, please, let me know

KexinFeng commented 1 year ago

So, is this just a flush of warning instead of an error?

It seems that these warning are thrown from PyTorch library. One of the reasons could be that the bert model *.pt file is not called in a proper way. As mentioned in https://github.com/deepjavalibrary/djl/issues/2144#issuecomment-1360144067, it is possible that the model is not exactly same if you do the switch:

.optModelPath(Paths.get("build/pytorch/traced_distilbert_wikipedia_uncased"))
//.optModelPath(Paths.get("build/pytorch/bert/bertBase"))

This is a model level debugging. Could you narrow down the issue?

adepase commented 1 year ago

It is a flush of warnings, apparently harmless. I just opened this issue to ask you if you understand it better and it can be harmful in other contexts, but the overall training and use of the trained model seems to be ok.

The switch above referenced a previously posted code. But the previously posted code comes from an example (and if you read again from the beginning the #2144 issue, you'll find there are really few changes). And the switch was a switch back to the previous code (#2144 began with a distilbert example, then I switched it to a bert example to understand differences and blocks differences in modelling, then I switched it again back to distilbert). The way distilbert model has been called is the same of the original example.

I have no idea how to narrow dowm the issue, in this case. But, maybe, cleaning up the code, should be a good start and also comparing it with the original example. I'll do that and come back in a new comment in this issue.

Thank you

JamRoronoa commented 1 year ago

hi, do you have any new ideas about the warning? The same warning appeared in https://github.com/ultralytics/yolov5/issues/10333#issue-1467648446