Closed knsong closed 4 years ago
Hi @knsong. Few questions:
One reason that comes to my mind is some wrong index file generated for this TFRecord. This error is raised only when the record that was read from TFRecord fails to be parsed by the protobuf library. So or the DALI reads the wrong chunk of data or the entry is corrupted. I think we can add a more verbose error message there to help people narrow down the problem.
Hi @knsong. Few questions:
- Is this problem occurring with only one TFRecord file?
No, I find error occurs when the TFRecord file is > 800MB, so there are multiple ones.
- Can you open and parse this file outside of DALI?
Yes, I can parse all the tf.train.Example
s successfully from these file with tensorflow and the idx file used by DALI.
- Is this file part of some public dataset that we can access? This would help us investigate this problem
Sorry, the dataset belong to the company and can't be made pubic. But because: 1. I can parse all the tf.train.Example
s successfully from the suspicious tfrecord files; 2. after I generated the tfrecord dataset within which a tfrecord file is not larger than 500MB, the problem never occurs, so maybe it is a bug related to the tfrecord file size?
Below is the sample code I can use to parse tf.train.Example
successfully from the tfrecord file.
offsets = open("./the_suspicious.tfrecord.idx", 'r').readlines()
trf_handle = open("./the_suspicious.tfrecord", 'rb')
for idx, offset in enumerate(offsets):
print("idx: {}, offset: {}".format(idx, offset))
start, size = map(int, offset.strip('\n').split(' '))
trf_handle.seek(start)
trf_handle.read(8)
trf_handle.read(4)
print('f tell start:', trf_handle.tell())
serialized_example_string = trf_handle.read(size - 8 - 4 - 4)
print('f tell end:', trf_handle.tell())
example = tf.train.Example.FromString(serialized_example_string)
print('example:', example.features.feature['image/class/label'])
@knsong - I don't see anything obvious that can lead to this problem in the code. We will try to reproduce it with some artificial TFRecord that is ~800MB and get back to you when we have any result.
@knsong - I just created some TFRecord with size > 900MB and it works. Again, can you try to provide some minimal and self-contained reproduction that we can run and debug?
@knsong - I just created some TFRecord with size > 900MB and it works. Again, can you try to provide some minimal and self-contained reproduction that we can run and debug?
Sorry for the late feedback, so it seems that this is a bug related to the modification by us. We will check it ourselves and feed back.
Please reopen when you can share a repro.
hi, when using pytorch and DALI tfrecord pipeline for training, I meet such error:
EDIT I find error occurs on the TFRecord file which is > 800MB.