Jittor / jittor

Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators.
https://cg.cs.tsinghua.edu.cn/jittor/
Apache License 2.0
3.07k stars 308 forks source link

could not create a descriptor for a dilated convolution forward propagation primitive #499

Closed Yellowfish666 closed 5 months ago

Yellowfish666 commented 6 months ago

Describe the bug

训练使用CNN过程中对loss值读取loss.data.mean(),出现上述报错。 检查过输入输出维度,并对网络内部计算流程进行debug,没有出现问题。但是loss.data属性无法正常读取

Full Log

[i 0319 22:15:38.294000 08 compiler.py:956] Jittor(1.3.8.5) src: d:\anaconda\envs\jittor\lib\site-packages\jittor [i 0319 22:15:38.311000 08 compiler.py:957] cl at C:\Users\86199.cache\jittor\msvc\VC_____\bin\cl.exe(19.29.30133) [i 0319 22:15:38.311000 08 compiler.py:958] cache_path: C:\Users\86199.cache\jittor\jt1.3.8\cl\py3.10.13\Windows-10-10.x30\12thGenIntelRCx65\default [i 0319 22:15:38.427000 08 init.py:411] Found gdb(8.1) at D:\Mingw64\mingw64\bin\gdb.EXE. [i 0319 22:15:38.461000 08 init.py:411] Found addr2line(2.30) at D:\Mingw64\mingw64\bin\addr2line.EXE. [i 0319 22:15:38.461000 08 init.py:227] Total mem: 15.73GB, using 5 procs for compiling. [i 0319 22:15:39.327000 08 jit_compiler.cc:28] Load ccpath: C:\Users\86199.cache\jittor\msvc\VC____\\bin\cl.exe Traceback (most recent call last): File "d:\Deep_Learning\task2.py", line 96, in train(model, train_data, train_label, test_data, test_label, criterion, optimizer) File "d:\Deep_Learning\task2.py", line 62, in train print(loss) File "D:\Anaconda\envs\jittor\lib\site-packages\jittor__init__.py", line 2003, in vtos data_str = f"jt.Var({v.data}, dtype={v.dtype})" RuntimeError: [f 0319 22:15:40.927000 08 executor.cc:686] Execute fused operator(58/75) failed.

[OP TYPE]: fused_op:( broadcast_to, reindex, binary.multiply, reduce.add,) [Input]: float32[16,3,3,3,]conv1.weight, uint8[50000,3,32,32,],

[Async Backtrace]: not found, please set env JT_SYNC=1, trace_py_var=3 [Reason]: could not create a descriptor for a dilated convolution forward propagation primitive

Minimal Reproduce

import jittor as jt import matplotlib.pyplot as plt import sys from jittor import nn import numpy as np

folder = 'cifar-10-batches-py'

def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict

class CNN(nn.Module): def init(self): super(CNN, self).init() self.conv1 = nn.Conv(3, 16, 3, padding=1) self.conv2 = nn.Conv(16, 32, 3, padding=1) self.pool = nn.Pool(2, op='maximum') self.fc1 = nn.Linear(32 8 8, 128) self.fc2 = nn.Linear(128, 32) self.fc3 = nn.Linear(32, 1) self.relu = nn.ReLU()

def execute(self, x):
    x = self.pool(self.relu(self.conv1(x)))  # Conv1 -> ReLU -> Pool
    x = self.pool(self.relu(self.conv2(x)))
    x = x.reshape(-1, 32 * 8 * 8)
    x = self.relu(self.fc1(x))
    x = self.relu(self.fc2(x))
    x = self.fc3(x)
    return x

class RNN(nn.Module): def init(self): super(RNN, self).init() self.rnn = nn.RNN(96, 128, 2, True, True) self.fc1 = nn.Linear(128, 32) self.fc2 = nn.Linear(32, 1) self.relu = nn.ReLU()

def execute(self, x):
    x, _ = self.rnn(x)
    x = self.relu(self.fc1(x[:, -1, :]))
    x = self.fc2(x)
    return x

def load_data(file): data = unpickle(file) return jt.array(data[b'data'].reshape(-1, 3, 32, 32)), jt.array(data[b'labels'])

def load_test_data(): data = unpickle(folder + '/test_batch') return jt.array(data[b'data'].reshape(-1, 3, 32, 32)), jt.array(data[b'labels'])

def train(model, train_data, train_label, test_data, test_label, criterion, optimizer, num_epochs=100): losses = [] for epoch in range(num_epochs): outputs = model(train_data) loss = criterion(outputs, train_label) print(loss) optimizer.step(loss) losses.append(loss.data.mean()) if (epoch+1) % 10 == 0: print('Epoch[{}/{}], loss: {:.6f}'.format(epoch+1, num_epochs, loss.data.mean())) plt.plot(losses) plt.show() print('Finished Training')

测试模型

model.eval()  # 将模型转换为测试模式
with jt.no_grad():
    correct = 0
    total = 0
    outputs = model(test_data)
    _, predicted = jt.max(outputs.data, 1)
    total += test_label.size(0)
    correct += (predicted == test_label).sum().item()
    print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))

if name == 'main': train_data, train_label = load_data(folder + '/data_batch_1') for i in range(2, 6): data, label = load_data(folder + '/databatch' + str(i)) train_data = jt.contrib.concat([train_data, data], dim=0) train_label = jt.contrib.concat([train_label, label], dim=0) test_data, test_label = load_test_data() train_label = train_label.reshape(-1, 1) test_label = test_label.reshape(-1, 1)

print(train_data.shape, train_label.shape, test_data.shape, test_label.shape)

# sys.exit()
model = CNN()
# model = RNN()
criterion = nn.CrossEntropyLoss()
optimizer = jt.optim.SGD(model.parameters(), lr=0.01)
train(model, train_data, train_label, test_data, test_label, criterion, optimizer)

Expected behavior

A clear and concise description of what you expected to happen. 程序可以正常读取loss中的数据,并进行训练

Yellowfish666 commented 5 months ago

问题已解决,将训练以及测试的数据转化为jt.float32类型即可正常训练