NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.12k stars 619 forks source link

Error in trying out DALI 3D Augmentation code from NVIDIA #4773

Closed dyhan316 closed 1 year ago

dyhan316 commented 1 year ago

Hello, I am trying to use this code from NVIDIA on our data to accelerate augmentation in the GPU ((link))

Using the dataloader here url

(the get_dataloader_fn was slightly modified to to our dataset, as the code below)

import numpy as np
#from data_loading.dali_loader import fetch_dali_loader
import dali_loader #.py로 가져옴
from sklearn.model_selection import KFold
#from utils.utils import get_split, load_data #imported into the code itself 
import glob

def get_split(data, idx):
    return list(np.array(data)[idx])

def load_data(path, files_pattern):
    return sorted(glob.glob(os.path.join(path, files_pattern)))

def get_dataloader_fn(*, data_dir: str, batch_size: int, precision: str):
    kwargs = {
        "dim": 3,
        "gpus": 1,
        "seed": 0,
        "num_workers": 8,
        "meta": None,
        "oversampling": 0,
        "benchmark": False,
        "patch_size": [128, 128, 128],
    }

    #======modieified=======#
    #imgs, lbls = load_data(data_dir, "*_x.npy"), load_data(data_dir, "*_y.npy")
    imgs = load_data(data_dir, "*.npy")
    lbls = np.zeros(len(imgs)) #just set all labels to zero 
    #=======================#

    kfold = KFold(n_splits=5, shuffle=True, random_state=12345)
    _, val_idx = list(kfold.split(imgs))[2]
    imgs, lbls = get_split(imgs, val_idx), get_split(lbls, val_idx)
    dataloader = fetch_dali_loader(imgs, lbls, batch_size, "bermuda", **kwargs) #kwargs :이미 지정해준 것이 있다  (맨위에)

    def _dataloader_fn():
        for i, batch in enumerate(dataloader):
            fname = [f"{i}_{j}" for j in range(batch_size)]
            img = batch["image"].numpy()
            if "fp16" in precision:
                img = img.astype(np.half)
            img = {"INPUT__0": img}
            lbl = {"OUTPUT__0": batch["label"].squeeze(1).numpy().astype(int)}
            yield fname, img, lbl

    return _dataloader_fn

With this modified code, I tried to load the data as follows :

get_dataloader_fn(data_dir = "/hpcgpfs01/scratch/dyhan316/testing_4_dali", 
                 batch_size = 32, precision=None)

However, when I ran it I got the following error :

/sdcc/u/dyhan316/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/nvidia/dali/ops.py:653: DeprecationWarning: WARNING: `numpy_reader` is now deprecated. Use `readers.numpy` instead.
In DALI 1.0 all readers were moved into a dedicated :mod:`~nvidia.dali.fn.readers`
submodule and renamed to follow a common pattern. This is a placeholder operator with identical
functionality to allow for backward compatibility.
  op_instances.append(_OperatorInstance(input_set, self, **kwargs))
/sdcc/u/dyhan316/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/nvidia/dali/ops.py:653: DeprecationWarning: WARNING: `numpy_reader` is now deprecated. Use `readers.numpy` instead.
In DALI 1.0 all readers were moved into a dedicated :mod:`~nvidia.dali.fn.readers`
submodule and renamed to follow a common pattern. This is a placeholder operator with identical
functionality to allow for backward compatibility.
  op_instances.append(_OperatorInstance(input_set, self, **kwargs))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [46], in <cell line: 1>()
----> 1 get_dataloader_fn(data_dir = "/hpcgpfs01/scratch/dyhan316/testing_4_dali", 
      2                  batch_size = 32, precision=None)

Input In [45], in get_dataloader_fn(data_dir, batch_size, precision)
     35 imgs, lbls = get_split(imgs, val_idx), get_split(lbls, val_idx)
     37 import pdb ; pdb.set_trace()
---> 38 dataloader = fetch_dali_loader(imgs, lbls, batch_size, "bermuda", **kwargs) #kwargs :이미 지정해준 것이 있다  (맨위에)
     40 def _dataloader_fn():
     41     for i, batch in enumerate(dataloader):

Input In [6], in fetch_dali_loader(imgs, lbls, batch_size, mode, **kwargs)
    322 device_id = int(os.getenv("LOCAL_RANK", "0"))
    323 pipe = pipeline(batch_size, kwargs["num_workers"], device_id, **pipe_kwargs)
--> 324 return LightningWrapper(
    325     pipe,
    326     auto_reset=True,
    327     reader_name="ReaderX",
    328     output_map=output_map,
    329     dynamic_shape=dynamic_shape,
    330 )

Input In [6], in LightningWrapper.__init__(self, pipe, **kwargs)
    263 def __init__(self, pipe, **kwargs):
--> 264     super().__init__(pipe, **kwargs)

File ~/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/nvidia/dali/plugin/pytorch.py:194, in DALIGenericIterator.__init__(self, pipelines, output_map, size, reader_name, auto_reset, fill_last_batch, dynamic_shape, last_batch_padded, last_batch_policy, prepare_first_batch)
    192 if self._prepare_first_batch:
    193     try:
--> 194         self._first_batch = DALIGenericIterator.__next__(self)
    195         # call to `next` sets _ever_consumed to True but if we are just calling it from
    196         # here we should set if to False again
    197         self._ever_consumed = False

File ~/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/nvidia/dali/plugin/pytorch.py:211, in DALIGenericIterator.__next__(self)
    208     return batch
    210 # Gather outputs
--> 211 outputs = self._get_outputs()
    213 data_batches = [None for i in range(self._num_gpus)]
    214 for i in range(self._num_gpus):

File ~/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/nvidia/dali/plugin/base_iterator.py:297, in _DaliBaseIterator._get_outputs(self)
    295     for p in self._pipes:
    296         with p._check_api_type_scope(types.PipelineAPIType.ITERATOR):
--> 297             outputs.append(p.share_outputs())
    298 except StopIteration as e:
    299     # in case ExternalSource returns StopIteration
    300     if self._size < 0 and self._auto_reset == "yes":

File ~/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/nvidia/dali/pipeline.py:1000, in Pipeline.share_outputs(self)
    998 self._batches_to_consume -= 1
    999 self._gpu_batches_to_consume -= 1
-> 1000 return self._pipe.ShareOutputs()

RuntimeError: Critical error in pipeline:
Error when executing CPU operator NumpyReader, instance name: "ReaderY", encountered:
[/opt/dali/dali/util/std_file.cc:29] Assert on "fp_ != nullptr" failed: Could not open file 0.0: No such file or directory
Stacktrace (10 entries):
[frame 0]: /sdcc/u/dyhan316/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/nvidia/dali/libdali.so(+0xc151b) [0x2ae98288d51b]
[frame 1]: /sdcc/u/dyhan316/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/nvidia/dali/libdali.so(+0x1e7fad) [0x2ae9829b3fad]
[frame 2]: /sdcc/u/dyhan316/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/nvidia/dali/libdali.so(dali::FileStream::Open(std::string const&, bool, bool)+0x13d) [0x2ae9829a44ad]
[frame 3]: /sdcc/u/dyhan316/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/nvidia/dali/libdali_operators.so(+0x3bd65f6) [0x2ae9a1de45f6]
[frame 4]: /sdcc/u/dyhan316/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/nvidia/dali/libdali_operators.so(+0x3d5d11a) [0x2ae9a1f6b11a]
[frame 5]: /sdcc/u/dyhan316/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/nvidia/dali/libdali_operators.so(+0x3d5e0f2) [0x2ae9a1f6c0f2]
[frame 6]: /sdcc/u/dyhan316/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/nvidia/dali/libdali_operators.so(+0x3d5c115) [0x2ae9a1f6a115]
[frame 7]: /sdcc/u/dyhan316/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/nvidia/dali/libdali_operators.so(+0x4644fa0) [0x2ae9a2852fa0]
[frame 8]: /lib64/libpthread.so.0(+0x7dc5) [0x2ae9438e5dc5]
[frame 9]: /lib64/libc.so.6(clone+0x6d) [0x2ae943bf176d]

Current pipeline object is no longer valid.

Is this because the original code was using legacy codes? (for example, not decorating @pipeline_def and instead inheriting from the pipeline class?

JanuszL commented 1 year ago

Hi @dyhan316,

Thank you for reaching out.

Is this because the original code was using legacy codes? (for example, not decorating @pipeline_def and instead inheriting from the pipeline class?

It is just a warning, but DALI still supports this functionality/API. The issue you see is imgs, lbls are expected to be a list of files while you replaced lbls with the numpy array. What you can do it to save this test, zero value label into the disc, and return a list of the same length as the images but just with the path to this particular label file.

dyhan316 commented 1 year ago

Thank you so much for your fast and detailed response!