NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.1k stars 615 forks source link

Decoding a sequence of images stored with TFRecords #5621

Closed mausset closed 1 month ago

mausset commented 1 month ago

I'm trying to use DALI's TFRecords functionality to load the video and segmentation portions of the MOVi-E dataset. According to the documentation for MOVi-E the "video" and "segmentations" features have the following signatures:

Each sample contains the following video-format data: (s: sequence length, h: height, w: width)

Each video and corresponding segmentations should have 24 frames, i.e. s=24. I seem to be able to load the features just fine, but when I decode them to images with decoders.image I am only receiving one frame (presumably the first frame) for both the video and segmentation. Is there any other decoder / method to deal with this data in DALI? Below is my current pipeline:

@pipeline_def # type: ignore
def movi_pipe(
    path="",
    index_path="",
    resolution=(224, 224),
    shuffle=True,
    shard_id=0,
    num_shards=1,
):
    features = { # type: ignore
        "video": tfrec.FixedLenFeature((), tfrec.string, ""), # type: ignore
        "segmentations": tfrec.FixedLenFeature((), tfrec.string, ""), # type: ignore
    }

    inputs = fn.readers.tfrecord( # type: ignore
        path=path,
        index_path=index_path,
        features=features,
        shard_id=shard_id,
        num_shards=num_shards,
        random_shuffle=shuffle,
        initial_fill=8,
        name="Reader",
    )

    video_bytes = inputs["video"] # type: ignore
    segmentations_bytes = inputs["segmentations"] # type: ignore

    video = fn.decoders.image(video_bytes, device="cpu") # type: ignore
    segmentations = fn.decoders.image(segmentations_bytes, device="cpu", output_type=types.DALIImageType.GRAY) # type: ignore

    video = fn.resize(video, size=resolution) # type: ignore
    segmentations = fn.resize(segmentations, size=resolution, antialias=False, interp_type=types.DALIInterpType.INTERP_NN) # type: ignore

    coin = fn.random.coin_flip(probability=0.5)
    mean = [0.485 * 255, 0.456 * 255, 0.406 * 255]
    std = [0.229 * 255, 0.224 * 255, 0.225 * 255]
    video = fn.crop_mirror_normalize( # type: ignore
        video, # type: ignore
        dtype=types.FLOAT, # type: ignore
        output_layout="CHW",
        mean=mean,
        std=std,
        mirror=coin,
    )

    return video, segmentations

Thanks in advance!

Check for duplicates

JanuszL commented 1 month ago

Hi @mausset,

Thank you for reaching out. Can you tell us (if you know, of course) how the data is stored in the TFRecord? I each record a set of frame arrays. Do you have any sample data you can share or any reference code we can refer to that handles this data?

mausset commented 1 month ago

Thanks for the quick response!

The dataset comes with two .json files, dataset_info.json and features.json, that give some description as to the structure. I've tried to find the most pertinent information from these for how the data is laid out and should be parsed. I'm not sure what frame arrays you are referring to otherwise. It seems like the videos are png encoded.

dataset_info.json

The scene is simulated for 2 seconds, with the physical properties of the
objects kept at the default of friction=0.5, restitution=0.5 and density=1.0.

The dataset contains approx 10k videos rendered at 256x256 pixels and 12fps.

Each sample contains the following video-format data:
(s: sequence length, h: height, w: width)

- "video": (s, h, w, 3) [uint8]
  The RGB frames.
- "segmentations": (s, h, w, 1) [uint8]
  Instance segmentation as per-pixel object-id with background=0.
  Note: because of this the instance IDs used here are one higher than their
  corresponding index in sample["instances"].
...

features.json

Key: type
Value: tensorflow_datasets.core.features.features_dict.FeaturesDict
Key: content
Value: {
  "features": {
    "segmentations": {
      "pythonClassName": "tensorflow_datasets.core.features.sequence_feature.Sequence",
      "jsonFeature": {
        "json": "{\"feature\": {\"type\": \"tensorflow_datasets.core.features.image_feature.Image\", \"content\": \"{\\n  \\\"shape\\\": {\\n    \\\"dimensions\\\": [\\n      \\\"256\\\",\\n      \\\"256\\\",\\n      \\\"1\\\"\\n    ]\\n  },\\n  \\\"dtype\\\": \\\"uint8\\\"\\n}\", \"proto_cls\": \"tensorflow_datasets.ImageFeature\"}, \"length\": 24}"
      }
    },
    "background": {
      "pythonClassName": "tensorflow_datasets.core.features.text_feature.Text",
      "text": {}
    },
    "normal": {
      "pythonClassName": "tensorflow_datasets.core.features.video_feature.Video",
      "video": {
        "shape": {
          "dimensions": [
            "24",
            "256",
            "256",
            "3"
          ]
        },
        "dtype": "uint16",
        "encodingFormat": "png"
      }
...

I don't really have any reference code, other than the tensorflow_datasets.load being able to parse out the structure, e.g. the following code from this repo:

import tensorflow_datasets as tfds
import torchvision.utils as vutils

from torchvision import transforms

parser = argparse.ArgumentParser()

parser.add_argument('--out_path', default='MOVi/')
parser.add_argument('--level', default='e', help='c or e')
parser.add_argument('--split', default='train', help='train, validation or test')

parser.add_argument('--version', default='1.0.0')
parser.add_argument('--image_size', type=int, default=128)
parser.add_argument('--max_num_objs', type=int, default=25)

args = parser.parse_args()

ds, ds_info = tfds.load(f"movi_{args.level}/{args.image_size}x{args.image_size}:{args.version}", data_dir="gs://kubric-public/tfds", with_info=True)
train_iter = iter(tfds.as_numpy(ds[args.split]))

to_tensor = transforms.ToTensor()

b = 0
print('Please be patient; it is usually very slow.')
for record in tqdm(train_iter):
    video = record['video']
mausset commented 1 month ago

Perhaps I should just use tensorflow_datasets if this is outside the purview of Nvidia DALI.

JanuszL commented 1 month ago

Hi @mausset,

I see it uses "tensorflow_datasets.core.features.sequence_feature.Sequence" which makes the loader treat consecutive records as a part of the same sample (which needs to be stacked). In the case of DALI, it treats each record as a separate sample. What you can do is use the tfrecord library with DALI's external source operator to read the data and then use DALI to decode it and process further.

mausset commented 1 month ago

Ah! So this is time to use external_source then. Thank you very much for the guidance! Have a great day : )