NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.15k stars 621 forks source link

How can I make DALI load tf.SequenceExample records? #3824

Open davors72 opened 2 years ago

davors72 commented 2 years ago

Hi,

Running into an issue with DALI. I'm working with a dataset stored in the format of https://www.tensorflow.org/api_docs/python/tf/train/SequenceExample. Other tfrecord readers handle it fine, such as https://github.com/vahidk/tfrecord.

The error is on reading the index file Assert on "p != nullptr" failed: Error reading from a file {FILE} the file is valid however the index produced is somewhat odd in that it is just a single line of 0 152207822 despite this being many records. The indexer in the above tool produces the same result but can still load it fine.

The failing part:

inputs = fn.readers.tfrecord(
   path=all_files,
   index_path=indices,
   features={
    # Doesn't actually matter which feature I try to pull out.
     "image/encoded" : tfrec.VarLenFeature(tfrec.string, "")},
    },  num_shards=1, name='Reader')
dali/dali/operators/reader/loader/indexed_file_loader.h:76

DALI version comes from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch using tag 22.03-py3

Any advice? Are these tf.SequenceExample just not supported in DALI? Are they on a roadmap?

FYI: The PyTorch loader above succeeds but takes in a separate argument for sequential features.

tfrecord.tfrecord_loader(path, None, sequence_description={"image/encoded": "byte"})

Thanks!

JanuszL commented 2 years ago

Hi @davors72,

Can you provide an example file in this format with a self-contained repro script so we can run this on our end? Likely, that is not supported by DALI as it uses a different schema than DALI currently supports. In the meantime, you can try out the external_source operator in parallel mode and utilize https://github.com/vahidk/tfrecord.

davors72 commented 2 years ago

Hi at @JanuszL, if you have the google cloud downloader you can download a sample from the dataset i was looking at with this command: gsutil cp gs://objectron/v1/sequences/book/book_train-01200-of-01324 .

Trying to load that record with the above command should produce the error

davors72 commented 2 years ago

Or the instructions here for the sequential version https://github.com/google-research-datasets/Objectron

JanuszL commented 2 years ago

Hi @davors72,

The schema that this data set implement is just not supported by DALI. As I have mentioned, it would be best to use an external source operator. Also, we would be more than happy to accept any PR that would extend the TFRecord reader by this schema.