Open ZhangXiaoXuan2019 opened 2 years ago
可能是编译命令里路径出了一些问题,您能把截图发全一些吗?
您好,感谢您及时地回复我们,以下是上面截图中更为完整的报错信息,我们对[Reason]后的信息进行了必要的换行
File "/home/xiaoxuan/PythonWorks/JNeRF_Fed_KD/FedML/fedml_api/standalone/fedavg/my_model_trainer.py", line 127, in local_model_render img, img_tar = self.render_img(client_s_idx) # in dataset, the image index is the client index
File "/home/xiaoxuan/PythonWorks/JNeRF_Fed_KD/FedML/fedml_api/standalone/fedavg/my_model_trainer.py", line 167, in render_img pos, dir = self.sampler.sample(img_ids, rays_o, rays_d)
File "/home/xiaoxuan/PythonWorks/JNeRF_Fed_KD/python/jnerf/models/samplers/density_grid_sampler/density_grid_sampler.py", line 137, in sample coords, rays_index, rays_numsteps, rays_numsteps_counter = self.rays_sampler.execute(
File "/home/xiaoxuan/PythonWorks/JNeRF_Fed_KD/python/jnerf/models/samplers/density_grid_sampler/ray_sampler.py", line 34, in execute coords_out, rays_index, rays_numsteps,self.ray_numstep_counter = jt.code(
RuntimeError: [38;5;1m[f 0702 11:29:38.938348 28 executor.cc:665]
Execute fused operator(1/2) failed.
[Input]: float32[1024,3,], float32[1024,3,1,], uint8[1310720,], float32[150,11,], int32[640000,], float32[150,4,3,], float32[1048576,7,], int32[1024,1,], int32[1024,2,], int32[2,],
[Output]: float32[1048576,7,], int32[1024,1,], int32[1024,2,], int32[2,],
tools/run_fednerf.py:48 <<module>>
tools/run_fednerf.py:41 <main>
/home/xiaoxuan/PythonWorks/JNeRF_Fed_KD/FedML/fedml_api/standalone/fedavg/fedavg_api.py:116 <train>
/home/xiaoxuan/PythonWorks/JNeRF_Fed_KD/FedML/fedml_api/standalone/fedavg/client.py:28 <train>
/home/xiaoxuan/PythonWorks/JNeRF_Fed_KD/FedML/fedml_api/standalone/fedavg/my_model_trainer.py:127 <local_model_render>
/home/xiaoxuan/PythonWorks/JNeRF_Fed_KD/FedML/fedml_api/standalone/fedavg/my_model_trainer.py:167 <render_img>
/home/xiaoxuan/PythonWorks/JNeRF_Fed_KD/python/jnerf/models/samplers/density_grid_sampler/density_grid_sampler.py:137 <sample>
/home/xiaoxuan/PythonWorks/JNeRF_Fed_KD/python/jnerf/models/samplers/density_grid_sampler/ray_sampler.py:34 <execute>
[Reason]: [38;5;1m[f 0702 11:29:38.938163 28 cache_compile.cc:295] Check failed: found Something wrong... Could you please report this issue?
Include file pcg32.h not found in [
/home/xiaoxuan/miniconda3/envs/JNeRF/lib/python3.8/site-packages/jittor/src,/home/xiaoxuan/miniconda3/envs/JNeRF/include/python3.8,
/home/xiaoxuan/miniconda3/envs/JNeRF/include/python3.8,/home/xiaoxuan/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/include,
/home/xiaoxuan/miniconda3/envs/JNeRF/lib/python3.8/site-packages/jittor/extern/cuda/inc,/home/xiaoxuan/.cache/jittor/jt1.3.4/g++9.4.0/py3.8.13/Linux-5.8.0-50x17/IntelRXeonRGolxda/default/cu11.2.152_sm_70,
/home/xiaoxuan/miniconda3/envs/JNeRF/lib/python3.8/site-packages/jittor/extern/cuda/inc,]
Commands:
"/home/xiaoxuan/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/bin/nvcc" "/home/xiaoxuan/.cache/jittor/jt1.3.4/g++9.4.0/py3.8.13/Linux-5.8.0-50x17/IntelRXeonRGolxda/default/cu11.2.152_sm_70/jit/codeIN_SIZE_6in0_dim_2in0_type_float32in1_dim_3in1_type_float32in2_dim_1in2__hash_a7c7342d82088594_op.cc"
-std=c++14
-Xcompiler
-fPIC
-Xcompiler -march=native
-Xcompiler -fdiagnostics-color=always
-lstdc++ -ldl -shared
-I"/home/xiaoxuan/miniconda3/envs/JNeRF/lib/python3.8/site-packages/jittor/src"
-I/home/xiaoxuan/miniconda3/envs/JNeRF/include/python3.8
-I/home/xiaoxuan/miniconda3/envs/JNeRF/include/python3.8 -DHAS_CUDA -DIS_CUDA
-I"/home/xiaoxuan/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/include"
-I"/home/xiaoxuan/miniconda3/envs/JNeRF/lib/python3.8/site-packages/jittor/extern/cuda/inc"
-lcudart -L"/home/xiaoxuan/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/lib64"
-Xlinker -rpath="/home/xiaoxuan/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/lib64"
-I"/home/xiaoxuan/.cache/jittor/jt1.3.4/g++9.4.0/py3.8.13/Linux-5.8.0-50x17/IntelRXeonRGolxda/default/cu11.2.152_sm_70"
-L"/home/xiaoxuan/.cache/jittor/jt1.3.4/g++9.4.0/py3.8.13/Linux-5.8.0-50x17/IntelRXeonRGolxda/default/cu11.2.152_sm_70"
-Xlinker -rpath="/home/xiaoxuan/.cache/jittor/jt1.3.4/g++9.4.0/py3.8.13/Linux-5.8.0-50x17/IntelRXeonRGolxda/default/cu11.2.152_sm_70"
-L"/home/xiaoxuan/.cache/jittor/jt1.3.4/g++9.4.0/py3.8.13/Linux-5.8.0-50x17/IntelRXeonRGolxda/default"
-Xlinker -rpath="/home/xiaoxuan/.cache/jittor/jt1.3.4/g++9.4.0/py3.8.13/Linux-5.8.0-50x17/IntelRXeonRGolxda/default"
-l:"jit_utils_core.cpython-38-x86_64-linux-gnu".so
-l:"jittor_core.cpython-38-x86_64-linux-gnu".so
-x cu --cudart=shared -ccbin="/usr/bin/g++" --use_fast_math -w
-I"/home/xiaoxuan/miniconda3/envs/JNeRF/lib/python3.8/site-packages/jittor/extern/cuda/inc"
-arch=compute_70 -code=sm_70 -o "/home/xiaoxuan/.cache/jittor/jt1.3.4/g++9.4.0/py3.8.13/Linux-5.8.0-50x17/IntelRXeonRGolxda/default/cu11.2.152_sm_70/jit/codeIN_SIZE_6in0_dim_2in0_type_float32in1_dim_3in1_type_float32in2_dim_1in2__hash_a7c7342d82088594_op.so"[m[m
看编译命令确实没有pcg32.h
的路径,该编译选项是在 coords_out.compile_options = proj_options设置的,您可以在这句话后面打印一下coords_out.compile_options
看是否有pcg32的路径。
您好!感谢您的建议。
我们这里所提出的这个运行时错误,在且仅在调用render_test函数时才会出现,在模型训练中调用sampler采样时,并无运行时错误。
报错的
coords_out, rays_index, rays_numsteps,self.ray_numstep_counter = jt.code(
一句,在coords_out.compile_options = proj_options
一句之前,所以报错时coords_out.compile_options = proj_options
一句并未执行,coords_out.compile_options 自然为空dict。
我们又尝试在报错语句前后都执行coords_out.compile_options = proj_options
,一句,错误未排除。
您看您是否还有其他建议?:)
jittor是lazy执行的所以一般执行jt.code后并不会马上编译,可能您修改了哪里导致它没有lazy执行了,您能把ray_sampler.py
的代码贴给我看一下吗?
以及您可以尝试在coords_out, rays_index, rays_numsteps,self.ray_numstep_counter = jt.code
之前添加rays_o.compile_options = proj_options
通过输入设置编译选项。
您好,感谢您的耐心解答。在”jt.code“一句之前添加rays_o.compile_options = proj_options
不解决问题。
ray_sampler.py文件如下。事实上我们目前没有对ray_sampler.py文件,以及python/jnerf/下的任何文件做出修改。且采样器在训练时执行采样操作不报错,当且仅当在render_test函数中报错。
import os
import jittor as jt
from jittor import Function, exp, log
import numpy as np
import sys
from jnerf.ops.code_ops.global_vars import global_headers, proj_options
jt.flags.use_cuda = 1
class RaySampler(Function): def init(self, density_grad_header, near_distance, cone_angle_constant, aabb_range=(-1.5, 2.5), n_rays_per_batch=4096, n_rays_step=1024): self.density_grad_header = density_grad_header self.aabb_range = aabb_range self.near_distance = near_distance self.n_rays_per_batch = n_rays_per_batch self.num_elements = n_rays_per_batch*n_rays_step self.cone_angle_constant = cone_angle_constant self.path = os.path.join(os.path.dirname(file), '..', 'op_include') self.ray_numstep_counter = jt.zeros([2], 'int32')
def execute(self, rays_o, rays_d, density_grid_bitfield, metadata, imgs_id, xforms):
# input
# rays_o n_rays_per_batch x 3
# rays_d n_rays_per_batch x 3
# bitfield 128 x 128 x 128 x 5 / 8
# return
# coords_out=[self.num_elements,7]
# rays index : store rays is used ( not for -1)
# rays_numsteps [0:step,1:base]
jt.init.zero_(self.ray_numstep_counter)
coords_out = jt.empty((self.num_elements, 7), 'float32')
self.n_rays_per_batch=rays_o.shape[0]
rays_index = jt.empty((self.n_rays_per_batch, 1), 'int32')
rays_numsteps = jt.empty((self.n_rays_per_batch, 2), 'int32')
coords_out, rays_index, rays_numsteps,self.ray_numstep_counter = jt.code(
inputs=[rays_o, rays_d, density_grid_bitfield, metadata, imgs_id, xforms], outputs=[coords_out,rays_index,rays_numsteps,self.ray_numstep_counter],
cuda_header=global_headers+self.density_grad_header+'#include "ray_sampler.h"', cuda_src=f"""
@alias(rays_o, in0)
@alias(rays_d, in1)
@alias(density_grid_bitfield,in2)
@alias(metadata,in3)
@alias(imgs_index,in4)
@alias(xforms_input,in5)
@alias(ray_numstep_counter,out3)
@alias(coords_out,out0)
@alias(rays_index,out1)
@alias(rays_numsteps,out2)
cudaStream_t stream=0;
cudaMemsetAsync(coords_out_p, 0, coords_out->size);
const unsigned int num_elements=coords_out_shape0;
const uint32_t n_rays=rays_o_shape0;
BoundingBox m_aabb = BoundingBox(Eigen::Vector3f::Constant({self.aabb_range[0]}), Eigen::Vector3f::Constant({self.aabb_range[1]}));
float near_distance = {self.near_distance};
float cone_angle_constant={self.cone_angle_constant};
linear_kernel(rays_sampler,0,stream,
n_rays, m_aabb, num_elements,(Vector3f*)rays_o_p,(Vector3f*)rays_d_p, (uint8_t*)density_grid_bitfield_p,cone_angle_constant,(TrainingImageMetadata *)metadata_p,(uint32_t*)imgs_index_p,
(uint32_t*)ray_numstep_counter_p,((uint32_t*)ray_numstep_counter_p)+1,(uint32_t*)rays_index_p,(uint32_t*)rays_numsteps_p,PitchedPtr<NerfCoordinate>((NerfCoordinate*)coords_out_p, 1, 0, 0),(Eigen::Matrix<float, 3, 4>*) xforms_input_p,near_distance,rng);
rng.advance();
""")
coords_out.compile_options = proj_options
# print(coords_out.compile_options)
coords_out.sync()
coords_out = coords_out.detach()
rays_index = rays_index.detach()
rays_numsteps = rays_numsteps.detach()
self.ray_numstep_counter = self.ray_numstep_counter.detach()
samples=self.ray_numstep_counter[1].item()
coords_out=coords_out[:samples]
return coords_out, rays_index, rays_numsteps, self.ray_numstep_counter
def grad(self, grad_x):
##should not reach here
assert(grad_x == None)
assert(False)
return None
此外,您提到的,导致jt.code一句没有lazy执行的原因,是可以通过某种方法去追溯的吗? 如果有,我们希望了解该种方法,因为日后我们可能要再基于JNeRF的实现做一些新的改动。每次出现问题都来麻烦JNeRF团队提供建议,也是不太现实的。
首先感谢你们的JNeRF工作 !@Gword 我们近日基于你们的JNeRF实现做了进一步的开发,代码运行时出现了 Execute fused operator failed的运行时错误,关键错误提示如下: 更具体地,我们发现,当且仅当调用JNeRF实现中的render_test函数进行测试图片的渲染时,该函数中的self.sampler.sample一句会报错。其他地方(如NeRF训练时)执行采样操作,无任何问题。 报错信息提示十分地模糊,只是"found something wrong", 导致我们在企图修改时,完全无从下手。 报错信息还提示”pcg32.h“未被找到,我们尝试该头文件将其加入至相关路径下,发现错误未排除,且报错信息又提示”ray_sampler.h“未包括。
请JNeRF团队评估一下这可能是什么地方的问题,并给我们一些建议,谢谢!