CosmoStat / autometacal

Metacalibration and shape measurement by automatic differentiation
MIT License
4 stars 1 forks source link

galflow.shear fails on GPU #5

Closed andrevitorelli closed 3 years ago

andrevitorelli commented 3 years ago

I know this is either galflow-specific or even more upstream issue, but I'll post it here for tracking. gf.shear fails with dead kernel, because perspective_transform from tfg-nightly fails. This, in turn, because _resampler_ops.so fails with message:

F tensorflow_addons/custom_ops/image/cc/kernels/resampler_ops_gpu.cu.cc:126] Non-OK-status: GpuLaunchKernel( Resampler2DKernel<T>, config.block_count, config.thread_per_block, 0, d.stream(), data, warp, output, batch_size, data_height, data_width, data_channels, num_sampling_points) status: Internal: no kernel image is available for execution on the device

This was not happening before, but seems to have started this week, after messing around with tensorflow-datasets installation, although I cannot see why.

hardware: GM108 cc: 5.0 driver: 460.32.03 cuda version: 11.0 tf: 2.4.1

andrevitorelli commented 3 years ago

Example code 1:

import tensorflow as tf
from tensorflow_graphics.image.transformer import perspective_transform
import numpy as np

image=np.random.uniform(0,1,size=64*64)
transfmatrix = np.identity(3)
image.shape = (1,64,64,1)
transfmatrix.shape = (1,3,3)

imgtf = tf.convert_to_tensor(image)
tmtf  = tf.convert_to_tensor(transfmatrix)
perspective_transform(imgtf,tmtf)

Example code 2:

import tensorflow as tf
import numpy as np
from tensorflow_addons.utils.resource_loader import LazySO

image=np.random.uniform(0,1,size=64*64)
transfmatrix = np.identity(2)
image.shape = (1,64,64,1)
transfmatrix.shape = (1,2,2)

imgtf = tf.convert_to_tensor(image)
tmtf  = tf.convert_to_tensor(transfmatrix)

_resampler_so = LazySO("custom_ops/image/_resampler_ops.so")

_resampler_so.ops.addons_resampler(imgtf,tmtf)
andrevitorelli commented 3 years ago

In more detail, example code 2 fails with:

[...]_resampler_ops.so: undefined symbol: 
_ZN10tensorflow15shape_inference16InferenceContext8SubshapeENS0_11ShapeHandleExxPS2_
EiffL commented 3 years ago

oh boy >.< I don't know what would be going on..... maybe try to uninstall and reinstall... you seem to have the right version of CUDA for your TF version....

EiffL commented 3 years ago

to make sure you all TF and extension libraries are compiled against each other

andrevitorelli commented 3 years ago

I did that so many times I lost count. I've even tried to install with older versions of tfa, tfg-nightly. It works in colab, and it works here in CPU. It's bizarre that it stopped working between last week and this one. In any case, I'll keep on using CPU to continue development. tks!

EiffL commented 3 years ago

hummmmm.... I'm gonna try to update all libraries on my machine and see, maybe they broke the nighlties

EiffL commented 3 years ago

Hummm I have updgraded tfa and tfg, and I see no issues on my GPU, a Titan X.

It looks to me like there is a cross-talk maybe between 2 different versions of TensorFlow on your system, which might explain why it doesnt find that symbol. But I've never seen that.

andrevitorelli commented 3 years ago

I have scrapped my entire python installation, cleaned it up, and rebuilt from scratch, but it still doesn't work. with tf-nightly-gpu it gives the symbol not found, with the tensorflow-gpu the ipython kernel dies with GpuLaunchKernel failing to find a CUDA kernel. I'll keep investigating.

EiffL commented 3 years ago

oooh wow.... ok. so I wouldn't use tf nightly-gpu, just tensorflow 2.4.1, installed with

$ pip install tensorflow==2.4.1
$ pip install tfg-nightly tfa-nightly

tensorflow-gpu is deprecated.

But actually, coming back to your initial error, I think it only meant that your GPU is too old for the available kernel. Which is ok, we'll get you access to some sweet Nvidia V100s, and for now you can use the CPU backend locally.

I don't know why your follow up errors with missing symbols arose though....

EiffL commented 3 years ago

@andrevitorelli did you resolve this in the end ^^'? Or is it really a problem specific to your local GPU?

andrevitorelli commented 3 years ago

No, I just let it as it is. For the time being, I'll test it on the CPU, but soon I hope to be using a better computer.

EiffL commented 3 years ago

Ok, I'm going to close this issue for now. Hopefully won't be a problem when we can run on better GPUs