NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.15k stars 621 forks source link

read 0 files from 0 directories #1226

Closed pawopawo closed 5 years ago

pawopawo commented 5 years ago

File "train_test.py", line 255, in main pipe.build() File "/usr/local/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 231, in build self._pipe.Build(self._names_and_devices) RuntimeError: [/opt/dali/dali/pipeline/operators/reader/loader/file_loader.h:108] Assert on "Size() > 0" failed: No files found. Stacktrace (28 entries): [frame 0]: /usr/local/lib/python3.6/site-packages/nvidia/dali/libdali.so(+0xafd5e) [0x7f71e5d7dd5e] [frame 1]: /usr/local/lib/python3.6/site-packages/nvidia/dali/libdali.so(+0x15768d) [0x7f71e5e2568d] [frame 2]: /usr/local/lib/python3.6/site-packages/nvidia/dali/libdali.so(+0x17d65f) [0x7f71e5e4b65f] [frame 3]: /usr/local/lib/python3.6/site-packages/nvidia/dali/libdali.so(std::_Function_handler<std::unique_ptr<dali::OperatorBase, std::default_delete > (dali::OpSpec const&), std::unique_ptr<dali::OperatorBase, std::default_delete > (*)(dali::OpSpec const&)>::_M_invoke(std::_Any_data const&, dali::OpSpec const&)+0xc) [0x7f71e5dd87fc] [frame 4]: /usr/local/lib/python3.6/site-packages/nvidia/dali/libdali.so(+0x1c98c4) [0x7f71e5e978c4] [frame 5]: /usr/local/lib/python3.6/site-packages/nvidia/dali/libdali.so(dali::InstantiateOperator(dali::OpSpec const&)+0x34e) [0x7f71e5e96cfe] [frame 6]: /usr/local/lib/python3.6/site-packages/nvidia/dali/libdali.so(dali::OpGraph::InstantiateOperators()+0x8f) [0x7f71e5da2c6f] [frame 7]: /usr/local/lib/python3.6/site-packages/nvidia/dali/libdali.so(dali::Pipeline::Build(std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > >)+0xd58) [0x7f71e5ef21b8] [frame 8]: /usr/local/lib/python3.6/site-packages/nvidia/dali/backend_impl.cpython-36m-x86_64-linux-gnu.so(+0x37ecf) [0x7f71ea7eeecf] [frame 9]: /usr/local/lib/python3.6/site-packages/nvidia/dali/backend_impl.cpython-36m-x86_64-linux-gnu.so(+0x21af3) [0x7f71ea7d8af3] [frame 10]: /usr/local/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x16c) [0x7f71fa43898c] [frame 11]: /usr/local/lib/libpython3.6m.so.1.0(+0x178b52) [0x7f71fa490b52] [frame 12]: /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x337) [0x7f71fa488ea7] [frame 13]: /usr/local/lib/libpython3.6m.so.1.0(+0x178d5a) [0x7f71fa490d5a] [frame 14]: /usr/local/lib/libpython3.6m.so.1.0(+0x178ad7) [0x7f71fa490ad7] [frame 15]: /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x337) [0x7f71fa488ea7] [frame 16]: /usr/local/lib/libpython3.6m.so.1.0(+0x178d5a) [0x7f71fa490d5a] [frame 17]: /usr/local/lib/libpython3.6m.so.1.0(+0x178ad7) [0x7f71fa490ad7] [frame 18]: /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x337) [0x7f71fa488ea7] [frame 19]: /usr/local/lib/libpython3.6m.so.1.0(PyEval_EvalCodeEx+0x21b) [0x7f71fa4872fb] [frame 20]: /usr/local/lib/libpython3.6m.so.1.0(PyEval_EvalCode+0x1b) [0x7f71fa4870db] [frame 21]: /usr/local/lib/libpython3.6m.so.1.0(+0x1ee1c2) [0x7f71fa5061c2] [frame 22]: /usr/local/lib/libpython3.6m.so.1.0(PyRun_FileExFlags+0x9a) [0x7f71fa50663a] [frame 23]: /usr/local/lib/libpython3.6m.so.1.0(PyRun_SimpleFileExFlags+0x1b7) [0x7f71fa5063f7] [frame 24]: /usr/local/lib/libpython3.6m.so.1.0(Py_Main+0x6c7) [0x7f71fa50d1e7] [frame 25]: python3(main+0x105) [0x400b05] [frame 26]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f71f9669b45] [frame 27]: python3() [0x400c51]

pawopawo commented 5 years ago

I can read the picture before. I don’t know what the reason is. Is there a problem with my converted dali data? But I was able to read the picture before. The code I use from https://github.com/NVIDIA/DALI/blob/master/docs/examples/pytorch/resnet50/main.py

awolant commented 5 years ago

Hi, thanks for the question. How did you modify the data? Maybe you run into an issue similar to this problem with folder structure. I've updated the docs for FileReader in #1222 to make things clear in this regard.

Quote from the updated docs:

FileReader supports flat directory structure. file_root directory should contain directories with images in them. To obtain labels FileReader sorts directories in file_root in alphabetical order and takes an index in this order as a class label.

pawopawo commented 5 years ago

[Warning]: File _train.lst has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm. [Warning]: File _train.idx has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm. [Warning]: File _train.rec has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm. [Warning]: File _train.lst has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm. [Warning]: File _train.idx has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm. [Warning]: File _train.rec has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm. [Warning]: File _train.lst has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm. [Warning]: File _train.idx has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm. [Warning]: File _train.rec has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm. [Warning]: File _val.lst has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm. [Warning]: File _val.rec has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm. [Warning]: File _val.idx has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm.

Is the ‘traindir’ in the HybridTrainPipe function not a converted dali data? Is the ‘traindir’ supposed to be the original imagenet dataset address (jpeg images)?

pawopawo commented 5 years ago

image Above is the address of my traindir

awolant commented 5 years ago

What do you mean by converted dali data? Yes, in this example traindir is ultimately passed to FileReader as a value for file_root parameter. FileReader reads jpeg files.

pawopawo commented 5 years ago

What do you mean by converted dali data? Yes, in this example traindir is ultimately passed to FileReader as a value for file_root parameter. FileReader reads jpeg files.

Thank you!

awolant commented 5 years ago

You're welcome. I'm closing the issue for now. If you have more questions or comments please do not hesitate to reopen or post another issue.

michaelklachko commented 5 years ago

When I use DALI hybrid pipelines to train on Imagenet:

        class HybridValPipe(Pipeline):
            def __init__(self, batch_size, num_threads, device_id, data_dir, crop, size):
                super(HybridValPipe, self).__init__(batch_size, num_threads, device_id, seed=12 + device_id)
                self.input = ops.FileReader(file_root=data_dir, shard_id=args.local_rank, num_shards=args.world_size, random_shuffle=False)
                self.decode = ops.ImageDecoder(device="mixed", output_type=types.RGB)
                self.res = ops.Resize(device="gpu", resize_shorter=size, interp_type=types.INTERP_TRIANGULAR)
                #self.cmnp = ops.CropMirrorNormalize(device="gpu", output_dtype=types.FLOAT, output_layout=types.NCHW, crop=(crop, crop),
                #image_type=types.RGB, mean=[0.485 * 255, 0.456 * 255, 0.406 * 255], std=[0.229 * 255, 0.224 * 255, 0.225 * 255])
                self.cmnp = ops.CropMirrorNormalize(device="gpu", output_dtype=types.FLOAT, output_layout=types.NCHW, crop=(crop, crop),
                                                    image_type=types.RGB, mean=[0, 0, 0], std=[255, 255, 255])

            def define_graph(self):
                self.jpegs, self.labels = self.input(name="Reader")
                images = self.decode(self.jpegs)
                images = self.res(images)
                output = self.cmnp(images)
                return [output, self.labels]

I'm seeing a bunch of errors (for both train and val pipelines):

[Warning]: File . has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm.
[Warning]: File .. has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm.
[Warning]: File . has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm.
[Warning]: File .. has extension that is not supproted by the decoder. Supported extensions: .jpg, .jpeg, .png, .gif, .bmp, .tif, .tiff, .pnm, .ppm, .pgm, .pbm.

These don't happen when I use:

train_dataset = datasets.ImageFolder(traindir, transforms.Compose([transforms.RandomResizedCrop(224),
                        transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]))
val_dataset = datasets.ImageFolder(valdir, transforms.Compose([transforms.Resize(256),
                        transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]))

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, num_workers=args.workers, pin_memory=False)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=args.batch_size, shuffle=False, num_workers=args.workers, pin_memory=False)

(pointing to the same train and val directories). Using DALI 0.13. There are hundreds if not thousands of these errors, but it seems like it still reads the images - at the end it says read 50000 files from 1000 directories for the validation subset (which is the correct length of the subset), despite the errors, and the model learns successfully.

JanuszL commented 5 years ago

Hi, DALI FIleReader assumes the following folder structure:

root_dir - class name - file
          |         \ -...
          |         \ - file N-th
          \- class name - file
          |         \ -...
          |         \ - file N-th
          ...

This warning says that except training/validation images DALI detects files that are not supported. Usually, dot and dot-dot are not returned by the readdir calls, but in this is your case. We will add a special case for that. If a not recognized file is encountered, it is just skipped so don't worry about the final result.

michaelklachko commented 5 years ago

BTW, this only happens when executed on a node in a cluster (on a mounted partition).

JanuszL commented 5 years ago

https://github.com/NVIDIA/DALI/pull/1318 should fix this problem