NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.08k stars 615 forks source link

DALIDataset reduce function not working #1548

Open viotemp1 opened 4 years ago

viotemp1 commented 4 years ago

Hello, I'm trying to apply reduce over a TF dataset from DALIDataset, but it does not work (either GPU or CPU). What could be wrong?

tensorflow 2.0.0
tensorflow-addons 0.6.0
tensorflow-datasets 1.3.0
tensorflow-estimator 2.0.1
tensorflow-gpu 2.0.0
tensorflow-metadata 0.15.0
tensorflow-model-optimization 0.1.3
Keras 2.3.1
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.0
nvidia-dali 0.15.0
nvidia-dali-nightly 0.17.0.dev20191202 nvidia-dali-tf-plugin-nightly 0.17.0.dev20191202

Here is part the code:

class TFRecordPipeline(Pipeline): def init(self, batch_size=1, device='gpu', num_threads=4, device_id=0, seed=0): super(TFRecordPipeline, self).init(batch_size, num_threads, device_id, seed) self.device = device self.input = ops.TFRecordReader(path = tfrecord, index_path = tfrecord_idx, features = { 'image_raw' : tfrec.FixedLenFeature((), tfrec.string, ""), 'label': tfrec.FixedLenFeature([1], tfrec.int64, -1), 'height': tfrec.FixedLenFeature([1], tfrec.int64, -1), 'width': tfrec.FixedLenFeature([1], tfrec.int64, -1), 'depth': tfrec.FixedLenFeature([1], tfrec.int64, -1) })

    self.decode = ops.ImageDecoder(device='mixed' if device is 'gpu' else 'cpu', output_type = types.GRAY)
    #self.resize = ops.Resize(device = "gpu", resize_shorter = 28.)
    self.iter = 0

def define_graph(self):
    inputs = self.input()
    images, labels = self.decode(inputs["image_raw"]), inputs["label"]
    if self.device is 'gpu':
        labels = labels.gpu()
    return (images, labels)

def iter_setup(self):
    #print(self.iter)
    #self.iter += 1
    pass

shapes = [ (BATCH_SIZE, 28, 28, 1), (BATCH_SIZE, 1)] dtypes = [ tf.uint8, # float32 tf.int64] def train_data_fn(batch_size=1, device='gpu', num_threads=4, device_id=0): pipeline = TFRecordPipeline(BATCH_SIZE, device=device, num_threads=num_threads, device_id = device_id) tf_dali_set = dali_tf.DALIDataset( pipeline=pipeline, batch_size=BATCH_SIZE, shapes=shapes, dtypes=dtypes, device_id=device_id)

mnist_set = mnist_set.map(lambda features, labels: ({'images': features}, labels))

#tf_dali_set = tf_dali_set.map(lambda features, labels: (features, labels))
return tf_dali_set

train_ds = train_data_fn(batch_size=BATCH_SIZE, device='cpu', num_threads=1, device_id=0)

trainds.reduce(np.int64(0), lambda x, : x + 1)

Errors: On CPU:

InternalError Traceback (most recent call last)

in ----> 1 train_ds.reduce(np.int64(0), lambda x, _: x + 1) ~/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py in reduce(self, initial_state, reduce_func) 1535 f=reduce_func, 1536 output_shapes=structure.get_flat_tensor_shapes(state_structure), -> 1537 output_types=structure.get_flat_tensor_types(state_structure))) 1538 1539 def unbatch(self): ~/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_dataset_ops.py in reduce_dataset(input_dataset, initial_state, other_arguments, f, output_types, output_shapes, use_inter_op_parallelism, name) 5049 else: 5050 message = e.message -> 5051 _six.raise_from(_core._status_to_exception(e.code, message), None) 5052 # Add nodes to the TensorFlow graph. 5053 if not isinstance(output_types, (list, tuple)): ~/anaconda3/envs/tf2/lib/python3.7/site-packages/six.py in raise_from(value, from_value) InternalError: DALI daliCopyTensorNTo( &pipeline_handle_, dst, out_id, dataset()->device_type_, dataset()->stream_, false) failed: [/opt/dali/dali/plugin/copy.cu:43] Coping from CPUBackend to device type 1 Stacktrace (11 entries): [frame 0]: /home/viorelublea/anaconda3/envs/tf2/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0x693be) [0x7fbccd3f23be] [frame 1]: /home/viorelublea/anaconda3/envs/tf2/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0x17ea55) [0x7fbccd507a55] [frame 2]: /home/viorelublea/anaconda3/envs/tf2/lib/python3.7/site-packages/nvidia/dali/libdali.so(dali::CopyToExternalTensor(dali::Tensor const&, void*, dali::device_type_t, CUstream_st*, bool)+0xd6) [0x7fbccd5080b6] [frame 3]: /home/viorelublea/anaconda3/envs/tf2/lib/python3.7/site-packages/nvidia/dali/libdali.so(daliCopyTensorNTo+0x38c) [0x7fbccd4fc9cc] [frame 4]: /home/viorelublea/anaconda3/envs/tf2/lib/python3.7/site-packages/nvidia/dali/plugin/libdali_tf_current.so(+0x12c81) [0x7fbccb58ec81] [frame 5]: /home/viorelublea/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_core/python/../libtensorflow_framework.so.2(tensorflow::data::DatasetBaseIterator::GetNext(tensorflow::data::IteratorContext*, std::vector >*, bool*)+0xae) [0x7fbce21e1f4e] [frame 6]: /home/viorelublea/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x5e7ff5d) [0x7fbce910bf5d] [frame 7]: /home/viorelublea/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_core/python/../libtensorflow_framework.so.2(tensorflow::data::BackgroundWorker::WorkerLoop()+0x191) [0x7fbce21dda81] [frame 8]: /home/viorelublea/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_core/python/../libtensorflow_framework.so.2(+0x167b5cf) [0x7fbce2c095cf] [frame 9]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9669) [0x7fbd4e04d669] [frame 10]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fbd4df75323] [Op:ReduceDataset] **On GPU:** --------------------------------------------------------------------------- InternalError Traceback (most recent call last) in ----> 1 train_ds.reduce(np.int64(0), lambda x, _: x + 1) ~/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py in reduce(self, initial_state, reduce_func) 1535 f=reduce_func, 1536 output_shapes=structure.get_flat_tensor_shapes(state_structure), -> 1537 output_types=structure.get_flat_tensor_types(state_structure))) 1538 1539 def unbatch(self): ~/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_dataset_ops.py in reduce_dataset(input_dataset, initial_state, other_arguments, f, output_types, output_shapes, use_inter_op_parallelism, name) 5049 else: 5050 message = e.message -> 5051 _six.raise_from(_core._status_to_exception(e.code, message), None) 5052 # Add nodes to the TensorFlow graph. 5053 if not isinstance(output_types, (list, tuple)): ~/anaconda3/envs/tf2/lib/python3.7/site-packages/six.py in raise_from(value, from_value) InternalError: DALI daliCopyTensorNTo( &pipeline_handle_, dst, out_id, dataset()->device_type_, dataset()->stream_, false) failed: CUDA runtime API error cudaErrorInvalidValue (11): invalid argument [Op:ReduceDataset]
awolant commented 4 years ago

Hi, thanks for the question.

This API was not really tested in the eager mode. It looks like TF does something different there and that is why there are problems when you try to copy data between devices. Additionally, dataset based on DALI pipeline is infinite, so you need something like take(10) to be able to do reduce.

I was able to run something similar with tf.Session with eager mode disabled. I will look more into this and let you know.

viotemp1 commented 4 years ago

Hi, thanks for the answer. I did not know that dataset based on DALI pipeline is infinite. I also found out this by looping today with take(1). Anyhow, I tried disabling eager mode in TF 2.0 (with tf.compat.v1.disable_eager_execution()), but this break so many other things for me.

train_ds10 = train_ds.take(10) train_ds10 ... <TakeDataset shapes: ((128, 28, 28, 1), (128, 1)), types: (tf.uint8, tf.int64)> ... train_ds_len = trainds1.reduce(np.int64(0), lambda x, : x + 1) train_ds_len ... <tf.Tensor 'ReduceDataset_8:0' shape=() dtype=int64>

No error indeed, but no result either (on gpu or cpu) Thanks

awolant commented 4 years ago

To get the results in this mode you need something like:

with tf.Session() as sess:
    print(sess.run(train_ds_len))

but I agree that this is rather a workaround than a solution. I'll look more into this and will update this thread when I know more.

ben0it8 commented 4 years ago

@awolant Is there any update about using DALI with TF 2.0/2.1 eager mode? :-)

JanuszL commented 4 years ago

@ben0it8 - we are currently pursuing other project goals and there is no update regarding using DALI in the eager mode. As soon as we can get back to it we update you.