NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.09k stars 615 forks source link

Compressed TFRecord and indexes stored on s3 #5513

Closed AlanRace closed 3 months ago

AlanRace commented 3 months ago

Version

1.38

Describe the bug.

Hello, I have been trying to use DALI to parse our large existing set of data stored as TFRecords (ZLIB compressed) on s3. I have encountered a few issues with this set-up:

Minimum reproducible example

from nvidia.dali.pipeline import Pipeline
import nvidia.dali.fn as fn
import nvidia.dali.tfrecord as tfrec
import nvidia.dali.types as types

batch_size = 10

pipe = Pipeline(batch_size=batch_size, num_threads=4, device_id=0)
with pipe:
    inputs = fn.readers.tfrecord(path="s3://path/to/data/data1.tfrecord",
                    index_path="s3://path/to/data/data1.idx", 
                    features={
                                'a': tfrec.FixedLenFeature((), tfrec.string, ""),
                                'b': tfrec.FixedLenFeature((), tfrec.string, ""),
                            }, random_shuffle=True)

    pipe.set_outputs(inputs['a'], inputs['b'])

pipe.build()

Relevant log output

No response

Other/Misc.

No response

Check for duplicates

jantonguirao commented 3 months ago

Thank you for reporting this, @AlanRace. I can confirm it is indeed a bug, and we'll push a fix shortly. Regarding zlib compressed TFRecords, it is unfortunately not supported at the moment.

jantonguirao commented 3 months ago

https://github.com/NVIDIA/DALI/pull/5515 should fix this issue. It should be available through nightly builds a day or two after it gets merged.