Input / Output Error tf.records after 3-4 ticks

Andreas-Atanasiu commented 4 years ago

Hello,

I'm using a custom dataset with ~6000 images. Did the conversion to tf.records. Training ran fine for 3 - 4 ticks, after that it failed with an error similar to this:

(this one happened when i tried resuming training without reruning the conversion)

Local submit - run_dir: results/00003-stylegan2-birdaus-1gpu-config-f
dnnlib: Running training.training_loop.training_loop() on localhost...
Streaming data using training.dataset.TFRecordDataset...
Traceback (most recent call last):
  File "run_training.py", line 198, in <module>
    main()
  File "run_training.py", line 193, in main
    run(**vars(args))
  File "run_training.py", line 126, in run
    dnnlib.submit_run(**kwargs)
  File "/content/drive/My Drive/stylegan2-colab-d/stylegan2/dnnlib/submission/submit.py", line 343, in submit_run
    return farm.submit(submit_config, host_run_dir)
  File "/content/drive/My Drive/stylegan2-colab-d/stylegan2/dnnlib/submission/internal/local.py", line 22, in submit
    return run_wrapper(submit_config)
  File "/content/drive/My Drive/stylegan2-colab-d/stylegan2/dnnlib/submission/submit.py", line 280, in run_wrapper
    run_func_obj(**submit_config.run_func_kwargs)
  File "/content/drive/My Drive/stylegan2-colab-d/stylegan2/training/training_loop.py", line 142, in training_loop
    training_set = dataset.load_dataset(data_dir=dnnlib.convert_path(data_dir), verbose=True, **dataset_args)
  File "/content/drive/My Drive/stylegan2-colab-d/stylegan2/training/dataset.py", line 192, in load_dataset
    dataset = dnnlib.util.get_obj_by_name(class_name)(**kwargs)
  File "/content/drive/My Drive/stylegan2-colab-d/stylegan2/training/dataset.py", line 59, in __init__
    for record in tf.python_io.tf_record_iterator(tfr_file, tfr_opt):
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/lib/io/tf_record.py", line 181, in tf_record_iterator
    reader.GetNext()
  File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/pywrap_tensorflow_internal.py", line 1034, in GetNext
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.UnknownError: datasets/birdaus/birdaus-r10.tfrecords; Input/output error

birdaus-r10.tfrecords is present in the correct folder, I just checked it.

I can't find any proof that my runtime disconnected, i was only 2 / 3 hours in on Collab Pro. Maybe it was a storage issue?

Should i make the dataset smaller?

Thank you, Andreas

Andreas-Atanasiu commented 4 years ago

Actually, the problem resides in how accessing Google Drive Files is viewed by Google. Accessing a tf.records file from drive during training on Colab counts towards the "download quota" of that file.

I didn't encounter this issue for a small dataset (~ 1000) but now for ~6000 it blocked access to specific files after a couple of ticks.

See https://github.com/googlecolab/colabtools/issues/1020.

Sadly there's no workaround at the moment.

dvschultz commented 4 years ago

yeah google drive and colab haven't played nice with each other for a few months now.

megrimm commented 3 years ago

Hello. I have been experiencing the exact same

r10.tfrecords; Input/output error

I started with ~5000 images which gave me error pretty much immediately. been working with ~2000 images which fails after a couple hours of training.

Any solutions for this?

dvschultz / stylegan2

Input / Output Error tf.records after 3-4 ticks #5