NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.15k stars 621 forks source link

source_info tensor not guaranteed to contain correct data #5377

Open treasan opened 8 months ago

treasan commented 8 months ago

Version

1.35

Describe the bug.

I am using a video reader pipeline as follows:

@pipeline_def
def read_decode_pipe(filenames, device='cpu'):
    video = fn.readers.video(
        sequence_length=384,
        filenames=filenames,
        pad_sequences=True,
        device=device
    )
    source_info = fn.get_property(video, key="source_info")
    return video, source_info

And I retrieve data from it using a DALIRaggedIterator:

pipe = read_decode_pipe(
    files,
    batch_size=batch_size,
    device=device,
    device_id=device_id,
    num_threads=n_threads,
)
pipe.build()
it = DALIRaggedIterator(
    pipe,
    output_map=['snippets', 'paths'],
    output_types=[DALIRaggedIterator.SPARSE_LIST_TAG, DALIRaggedIterator.SPARSE_LIST_TAG],
    auto_reset=False,
    last_batch_policy=LastBatchPolicy.PARTIAL
)

for data in it:
   snippets = data[0]['snippets']
   bytes_paths = data[0]['paths'] # <--- might not yet be filled with data
   str_paths = [path.cpu().numpy().tobytes().decode() for path in bytes_paths ]

Occasionally it happens that these encoded paths still hold no value at the time of decoding. Essentially they are tensors filled with zeros and the decoded path string is useless. Interestingly, when I set a breakpoint at that location and then apply the exact same decoding operation in the debug console, the strings are properly decoded all of a sudden. Probably because enough time has passed so that the tensors got filled with the actual data. This suggests that the source_info tensors get filled with data asynchronously. This is definitely unexpected behavior. The pipeline should await the data until it gets forwarded to the for loop.

Minimum reproducible example

No response

Relevant log output

No response

Other/Misc.

No response

Check for duplicates

JanuszL commented 8 months ago

Hi @treasan,

Can you provide a standalone and selfcontained repro? Something like this works for me:

docker run --rm -ti --gpus 'all,"capabilities=compute,utility,video"' ubuntu:22.04

apt update && apt install -y vim wget python3-pip
pip install --extra-index-url https://pypi.nvidia.com/ --upgrade nvidia-dali-cuda120 torch numpy
wget https://github.com/NVIDIA/DALI_extra/raw/main/db/video/sintel/sintel_trailer-720p.mp4
python3 test.py

import nvidia.dali.fn as fn
from nvidia.dali import pipeline_def
from nvidia.dali.plugin.pytorch import DALIRaggedIterator, LastBatchPolicy

files = ["sintel_trailer-720p.mp4"]
batch_size = 3
device = "gpu"
device_id = 0
n_threads = 4

@pipeline_def
def read_decode_pipe(filenames, device="cpu"):
    video = fn.readers.video(
        sequence_length=3, filenames=filenames, pad_sequences=True, device=device
    )
    source_info = fn.get_property(video, key="source_info")
    return video, source_info

pipe = read_decode_pipe(
    files,
    batch_size=batch_size,
    device=device,
    device_id=device_id,
    num_threads=n_threads,
)
pipe.build()
it = DALIRaggedIterator(
    pipe,
    output_map=["snippets", "paths"],
    output_types=[DALIRaggedIterator.SPARSE_LIST_TAG, DALIRaggedIterator.SPARSE_LIST_TAG],
    auto_reset=False,
    last_batch_policy=LastBatchPolicy.PARTIAL,
)

for data in it:
    snippets = data[0]["snippets"]
    bytes_paths = data[0]["paths"]  # <--- might not yet be filled with data
    str_paths = [path.cpu().numpy().tobytes().decode() for path in bytes_paths]
    print(str_paths)
treasan commented 2 months ago

Hello,

sorry for the really long delay. I fixed the bug back then by setting the exec_async parameter to False in the pipeline. So there seems (or seemed) to be a synchronization issue under the hood with the pipeline described in my original post.

JanuszL commented 2 months ago

Hi @treasan, Thank you for following up on this. I'm not saying that there is no issue, just if we don't have a reliable repro step we cannot debug and fix it. If you happen to extract a code snipped that illustrates the problem that would help us a lot.