NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.12k stars 619 forks source link

invalid TFRecord file when using pytorch and DALI tfrecord pipeline #1611

Closed knsong closed 4 years ago

knsong commented 4 years ago

hi, when using pytorch and DALI tfrecord pipeline for training, I meet such error:

Traceback (most recent call last):
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/apsara/TempRoot/Odps/ai_engine_dev_20191226021039184gblm8vzt2_36f92011_41c9_4cde_91be_baa8f36bed52_AlgoTask_0_0/PyTorchWorker@g34a01226.nt12#0/workspace/main.py", line 337, in main_worker
    rs['train'] = run_epoch(args, model, trainloader, criterion, optimizer, desc_default='train', epoch=epoch, scheduler=scheduler)
  File "/apsara/TempRoot/Odps/ai_engine_dev_20191226021039184gblm8vzt2_36f92011_41c9_4cde_91be_baa8f36bed52_AlgoTask_0_0/PyTorchWorker@g34a01226.nt12#0/workspace/main.py", line 164, in run_epoch
    for step, packed_data in enumerate(loader):
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/tqdm/std.py", line 1081, in __iter__
    for obj in iterable:
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 200, in __next__
    outputs.append(p.share_outputs())
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 402, in share_outputs
    return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline: Error in thread 0: [/opt/dali/dali/pipeline/operators/reader/parser/tfrecord_parser.h:66] Error while parsing TFRecord: [/opt/dali/dali/pipeline/operators/reader/parser/tfrecord_parser.h:63] Assert on "example.ParseFromArray(raw_data, length)" failed: Error in parsing - invalid TFRecord file!

EDIT I find error occurs on the TFRecord file which is > 800MB.

jantonguirao commented 4 years ago

Hi @knsong. Few questions:

JanuszL commented 4 years ago

One reason that comes to my mind is some wrong index file generated for this TFRecord. This error is raised only when the record that was read from TFRecord fails to be parsed by the protobuf library. So or the DALI reads the wrong chunk of data or the entry is corrupted. I think we can add a more verbose error message there to help people narrow down the problem.

knsong commented 4 years ago

Hi @knsong. Few questions:

  • Is this problem occurring with only one TFRecord file?

No, I find error occurs when the TFRecord file is > 800MB, so there are multiple ones.

  • Can you open and parse this file outside of DALI?

Yes, I can parse all the tf.train.Examples successfully from these file with tensorflow and the idx file used by DALI.

  • Is this file part of some public dataset that we can access? This would help us investigate this problem

Sorry, the dataset belong to the company and can't be made pubic. But because: 1. I can parse all the tf.train.Examples successfully from the suspicious tfrecord files; 2. after I generated the tfrecord dataset within which a tfrecord file is not larger than 500MB, the problem never occurs, so maybe it is a bug related to the tfrecord file size?

Below is the sample code I can use to parse tf.train.Example successfully from the tfrecord file.

    offsets = open("./the_suspicious.tfrecord.idx", 'r').readlines()
    trf_handle = open("./the_suspicious.tfrecord", 'rb')
    for idx, offset in enumerate(offsets):
        print("idx: {}, offset: {}".format(idx, offset))
        start, size = map(int, offset.strip('\n').split(' '))
        trf_handle.seek(start)
        trf_handle.read(8)
        trf_handle.read(4)
        print('f tell start:', trf_handle.tell())
        serialized_example_string = trf_handle.read(size - 8 - 4 - 4)
        print('f tell end:', trf_handle.tell())
        example = tf.train.Example.FromString(serialized_example_string)
        print('example:', example.features.feature['image/class/label'])
JanuszL commented 4 years ago

@knsong - I don't see anything obvious that can lead to this problem in the code. We will try to reproduce it with some artificial TFRecord that is ~800MB and get back to you when we have any result.

JanuszL commented 4 years ago

@knsong - I just created some TFRecord with size > 900MB and it works. Again, can you try to provide some minimal and self-contained reproduction that we can run and debug?

knsong commented 4 years ago

@knsong - I just created some TFRecord with size > 900MB and it works. Again, can you try to provide some minimal and self-contained reproduction that we can run and debug?

Sorry for the late feedback, so it seems that this is a bug related to the modification by us. We will check it ourselves and feed back.

JanuszL commented 4 years ago

Please reopen when you can share a repro.