Closed mausset closed 1 month ago
Hi @mausset,
Thank you for reaching out. Can you tell us (if you know, of course) how the data is stored in the TFRecord? I each record a set of frame arrays. Do you have any sample data you can share or any reference code we can refer to that handles this data?
Thanks for the quick response!
The dataset comes with two .json files, dataset_info.json and features.json, that give some description as to the structure. I've tried to find the most pertinent information from these for how the data is laid out and should be parsed. I'm not sure what frame arrays you are referring to otherwise. It seems like the videos are png encoded.
dataset_info.json
The scene is simulated for 2 seconds, with the physical properties of the
objects kept at the default of friction=0.5, restitution=0.5 and density=1.0.
The dataset contains approx 10k videos rendered at 256x256 pixels and 12fps.
Each sample contains the following video-format data:
(s: sequence length, h: height, w: width)
- "video": (s, h, w, 3) [uint8]
The RGB frames.
- "segmentations": (s, h, w, 1) [uint8]
Instance segmentation as per-pixel object-id with background=0.
Note: because of this the instance IDs used here are one higher than their
corresponding index in sample["instances"].
...
features.json
Key: type
Value: tensorflow_datasets.core.features.features_dict.FeaturesDict
Key: content
Value: {
"features": {
"segmentations": {
"pythonClassName": "tensorflow_datasets.core.features.sequence_feature.Sequence",
"jsonFeature": {
"json": "{\"feature\": {\"type\": \"tensorflow_datasets.core.features.image_feature.Image\", \"content\": \"{\\n \\\"shape\\\": {\\n \\\"dimensions\\\": [\\n \\\"256\\\",\\n \\\"256\\\",\\n \\\"1\\\"\\n ]\\n },\\n \\\"dtype\\\": \\\"uint8\\\"\\n}\", \"proto_cls\": \"tensorflow_datasets.ImageFeature\"}, \"length\": 24}"
}
},
"background": {
"pythonClassName": "tensorflow_datasets.core.features.text_feature.Text",
"text": {}
},
"normal": {
"pythonClassName": "tensorflow_datasets.core.features.video_feature.Video",
"video": {
"shape": {
"dimensions": [
"24",
"256",
"256",
"3"
]
},
"dtype": "uint16",
"encodingFormat": "png"
}
...
I don't really have any reference code, other than the tensorflow_datasets.load being able to parse out the structure, e.g. the following code from this repo:
import tensorflow_datasets as tfds
import torchvision.utils as vutils
from torchvision import transforms
parser = argparse.ArgumentParser()
parser.add_argument('--out_path', default='MOVi/')
parser.add_argument('--level', default='e', help='c or e')
parser.add_argument('--split', default='train', help='train, validation or test')
parser.add_argument('--version', default='1.0.0')
parser.add_argument('--image_size', type=int, default=128)
parser.add_argument('--max_num_objs', type=int, default=25)
args = parser.parse_args()
ds, ds_info = tfds.load(f"movi_{args.level}/{args.image_size}x{args.image_size}:{args.version}", data_dir="gs://kubric-public/tfds", with_info=True)
train_iter = iter(tfds.as_numpy(ds[args.split]))
to_tensor = transforms.ToTensor()
b = 0
print('Please be patient; it is usually very slow.')
for record in tqdm(train_iter):
video = record['video']
Perhaps I should just use tensorflow_datasets if this is outside the purview of Nvidia DALI.
Hi @mausset,
I see it uses "tensorflow_datasets.core.features.sequence_feature.Sequence"
which makes the loader treat consecutive records as a part of the same sample (which needs to be stacked). In the case of DALI, it treats each record as a separate sample. What you can do is use the tfrecord library with DALI's external source
operator to read the data and then use DALI to decode it and process further.
Ah! So this is time to use external_source then. Thank you very much for the guidance! Have a great day : )
I'm trying to use DALI's TFRecords functionality to load the video and segmentation portions of the MOVi-E dataset. According to the documentation for MOVi-E the "video" and "segmentations" features have the following signatures:
Each sample contains the following video-format data: (s: sequence length, h: height, w: width)
Each video and corresponding segmentations should have 24 frames, i.e. s=24. I seem to be able to load the features just fine, but when I decode them to images with decoders.image I am only receiving one frame (presumably the first frame) for both the video and segmentation. Is there any other decoder / method to deal with this data in DALI? Below is my current pipeline:
Thanks in advance!
Check for duplicates