LoSealL / VideoSuperResolution

A collection of state-of-the-art video or single-image super-resolution architectures, reimplemented in tensorflow.
MIT License
1.62k stars 296 forks source link

Issues loading dataset #40

Closed SandLeg closed 5 years ago

SandLeg commented 5 years ago

I'm having some issues loading in my own training dataset into the various models. I added the directories of the training data into dataset.yaml with the following code: DATA: train: /home/users/user/dataset/dataHR/Train/ test: /home/users/user/dataset/dataHR/Val/ train_pair: /home/users/user/dataset/dataLR/Train/ test_pair: /home/users/user/dataset/dataLR/Val/ param: parser: custom_pairs

Inside of the data the folders have the structure: 0001: 0001_00.tif 0001_01.tif ... 0002: 0002_00.tif 0002_01.tif ... ...

When I run VESPCN in Pytorch the model is able to train but the testing images are not loaded in: | 2019-05-02 14:18:59 | Epoch: 5/5 | LR: 0.0001 | 100%|#######################################| 200/200 [01:03<00:00, 3.53batch/s, image=00.00097, flow=00.00035, tv=01.90621] | Epoch average image = 0.001059 | | Epoch average flow = 0.000375 | | Epoch average tv = 2.140414 | 2019-05-02 14:20:02,334 WARNING: frames is empty. [size=10] Test: 0it [00:00, ?it/s]

When I run VESPCN in tensorflow the same thing happens:

INFO:tensorflow:Fitting: VESPCN | 2019-05-02 14:26:59 | Epoch: 1/10 | LR: 0.0001 | 100%|###################################| 200/200 [00:42<00:00, 4.79batch/s, l2=23615.77148, loss=24443.99414, me=828.22260] | Epoch average l2 = 35973.230469 | | Epoch average loss = 36285.953125 | | Epoch average me = 312.723450 | frames is empty. [size=10] Test: 0it [00:00, ?it/s]

When I run FRSVR I get the following errors:

2019-05-02 14:21:51,380 WARNING: frames is empty. [size=3200] | 2019-05-02 14:21:51 | Epoch: 8/10 | LR: 0.0001 | 0batch [00:00, ?batch/s] 2019-05-02 14:21:51,387 WARNING: frames is empty. [size=10] Test: 0it [00:00, ?it/s]

Then when I run FRSVR without custom pairs it seems to train but the testing data does not load:

| 2019-05-02 14:44:21 | Epoch: 1/11 | LR: 0.0001 | 100%|#####################| 200/200 [01:23<00:00, 3.14batch/s, total_loss=00.00239, image_loss=00.00236, flow_loss=00.00006] | Epoch average total_loss = 0.004055 | | Epoch average image_loss = 0.003931 | | Epoch average flow_loss = 0.000247 | 2019-05-02 14:45:44,950 WARNING: frames is empty. [size=10]

Any type of advice or fix would be appreciated.

LoSealL commented 5 years ago

@SandLeg Validation during training should be:

val: ...
val_pair: ...
SandLeg commented 5 years ago

@SandLeg Validation during training should be:

val: ...
val_pair: ...

Thanks for the reply. That is the way I have it in my file I just did not list it that way in the post. Just to be sure I ran it again with the same format (ie. train, train_pair, val, val_pair,test_test_pair) and I am still getting "WARNING: frames is empty."

LoSealL commented 5 years ago

@SandLeg Since I'm not able to reproduce your issue, could you fetch the latest 'dev' branch and do the following tests?

# pwd >VSRTorch
python train.py vespcn --cuda --dataset videopair --data_config ../UTest/data/fake_datasets.yml -v
# pwd >Train
python run.py --model vespcn --data_config ../UTest/data/fake_datasets.yml --dataset videopair -v
SandLeg commented 5 years ago

It seems your test is OK. So come back to the original issue, can it be related to mistake data URL?

I went back into datasets.yaml and edited my URL to make it similar to what you had in fake_datasets.yaml. I used quotes, used the root directory tag which I wasn't using previously and removing the end of URLs. Nothing that I did stopped the "WARNING: frames is empty. [size=10]" error from showing up

At the end it also says frames is empty in the unit test. Is that supposed to be there?

LoSealL commented 5 years ago

I recall that I assume that the validation data is no more than 1GB. If the size of them (all data sums up) is more than 1GB, may cause that problem.

1GB is hardcoded in Trainer.py. You can change it as you wish.

The last line of tensorflow's execution is expected.

LoSealL commented 5 years ago

Delete some out-of-order reply

SandLeg commented 5 years ago

I recall that I assume that the validation data is no more than 1GB. If the size of them (all data sums up) is more than 1GB, may cause that problem.

1GB is hardcoded in Trainer.py. You can change it as you wish.

I recall that I assume that the validation data is no more than 1GB. If the size of them (all data sums up) is more than 1GB, may cause that problem.

1GB is hardcoded in Trainer.py. You can change it as you wish.

The last line of tensorflow's execution is expected.

Yeah my training and validation set is fairly large. The validation set has around 200 images and comes in about 4.5 GB. The training set has over 400 images so it is probably much larger than that. I just changed the limit to 8GB and it seems like nothing changed Ill test it some more.

LoSealL commented 5 years ago

For training, you can use --memory_limit=???GB to constrain memory usage during training. For validation, I thought too many files could slow down the overall speed. If you have enough runtime memory, you could raise that value as I mentioned.

SandLeg commented 5 years ago

For training, you can use --memory_limit=???GB to constrain memory usage during training. For validation, I thought too many files could slow down the overall speed. If you have enough runtime memory, you could raise that value as I mentioned.

Yeah, that still did not do anything. So I tried running CARN and got a fatal file error with a specific line in the code. Maybe this will help you diagnose the problem:

2019-05-08 12:35:18,164 INFO: Fitting: [CARN] 2019-05-08 12:35:18,175 WARNING: frames is empty. [size=3200] | 2019-05-08 12:35:18 | Epoch: 1/13 | LR: 0.0001 | 0batch [00:00, ?batch/s] Exception in thread fetch_thread_7: Traceback (most recent call last): File "/user/miniconda3/envs/py36/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/user/miniconda3/envs/py36/lib/python3.7/threading.py", line 865, in run self._target(*self._args, *self._kwargs) File "/user/VideoSuperResolution/VSR/DataLoader/Loader.py", line 261, in _prefetch frames = self.parser[index interval:] File "/user/VideoSuperResolution/VSR/DataLoader/Parser/custom_pairs.py", line 43, in getitem ret += self.gen_frames(copy.deepcopy(vf), vf.frames) File "/user/VideoSuperResolution/VSR/DataLoader/Parser/custom_pairs.py", line 72, in gen_frames lr = [img for img in vf.pair.read_frame(depth)] File "/user/VideoSuperResolution/VSR/DataLoader/VirtualFile.py", line 352, in read_frame imagebytes = [BytesIO(self.read()) for in range(frames)] File "/user/VideoSuperResolution/VSR/DataLoader/VirtualFile.py", line 352, in imagebytes = [BytesIO(self.read()) for in range(frames)] File "/user/VideoSuperResolution/VSR/DataLoader/VirtualFile.py", line 126, in read raise FileNotFoundError('No frames in File') FileNotFoundError: No frames in File

Test: 100%|###############################################################| 10/10 [00:00<00:00, 30.97it/s] psnr: 11.185015,

LoSealL commented 5 years ago

Still not able to reproduce the problem. Have you pulled to the latest master or dev branch? What's your data exactly, I notice they are TIFF, can you send me a sample image?

LoSealL commented 5 years ago

custom_pairs parer needs to support video well. In developing...

SandLeg commented 5 years ago

Still not able to reproduce the problem. Have you pulled to the latest master or dev branch? What's your data exactly, I notice they are TIFF, can you send me a sample image?

I started on the master branch but I have been on the dev branch since you told me to run the tests. I'm not sure if I am able to share the images directly but they come from here http://xviewdataset.org/ . The frames come from a single image that is downsampled x amount of times using a convolutional filter. So the video would look like a sliding window through the original image if you looked at every frame in order. The data that I am trying to test only has four frames but I need to be able to work to up to 64 frames for other data that I have yet to create.

The images in the HR dataset are large also. Probably an average they are 1000x1000 images and the scale I am working with is 2 so the LR data is half that size on average

LoSealL commented 5 years ago

@SandLeg Fix related issue. Try at my latest dev branch, and tell me anything goes wrong. It's caused by a memory leak. You will need to use --memory_limit to avoid consuming too much memory. As for hardcoded 1GB for validation, you don't need to change that.

SandLeg commented 5 years ago

@SandLeg Fix related issue. Try at my latest dev branch, and tell me anything goes wrong. It's caused by a memory leak. You will need to use --memory_limit to avoid consuming too much memory. As for hardcoded 1GB for validation, you don't need to change that.

I figured out part of the issue but I am still not getting it to run. The last folder in my validation and training set was empty which is probably why I was getting the "No File in Frame" issue. Now all the models hang up at the same point. I get the "WARNING: frames is empty. [size=3200]" message then it hangs for a couple of seconds and runs the test which was never done before. Here's whats happening now:

2019-05-10 09:55:36,456 INFO: Fitting: [VESPCN] 2019-05-10 09:55:36,476 WARNING: frames is empty. [size=3200] | 2019-05-10 09:55:36 | Epoch: 2/13 | LR: 0.0001 | 0batch [00:00, ?batch/s] Test: 100%|###############################################################| 10/10 [00:00<00:00, 17.85it/s] psnr: 14.660599, 2019-05-10 09:55:38,422 WARNING: frames is empty. [size=3200]

Do you have any suggestions on dealing with the memory issue? Should I try to decrease the size of my dataset and see if that will get it to run?

LoSealL commented 5 years ago

That's odd. You can try to start with a subset of your dataset, for example, 10 videos with 10 frames each.

SandLeg commented 5 years ago

I tried it with 10 folders with 4 frames each and it's still doing the same thing. Something must not be loading right. I really don't know. I'll try some more investigation on Monday.

LoSealL commented 5 years ago

Much beyond my expectation. What if using another dataset? For example, let's use VID4 to train.

SandLeg commented 5 years ago

Much beyond my expectation. What if using another dataset? For example, let's use VID4 to train.

Sorry for the late reply but I have been busy and I am just getting back to this today. I fixed the issue. My path in dataset.yaml was wrong, which combined with the previous issue with the missing data is probably the main source of the issue. Thanks for all of your help.

And while your reading this do you know how to configure training for multiple GPUs?

LoSealL commented 5 years ago

For the existing models, they don't support multi-GPUs. But if you're familiar with multi-GPU programming, you can code as usual.

The entire training process is encapsulated in Train function, so you can manipulate your training procedure inside it. Moreover, you can overwrite Trainer to control the data path as you like.

Close this thread since the problem is solved.

dtlloyd commented 5 years ago

Tacking on to this, .tif files support 32 bit float values. In the case that the training set was composed of float arrays, rather than image-like integer arrays, .tif seems like a good choice. Would your code need substantial modifications to run with 32 bit float .tif files? Changing the .png output, mentioned in #57, would be one issue, I suspect.