Closed SandLeg closed 5 years ago
@SandLeg Validation during training should be:
val: ...
val_pair: ...
@SandLeg Validation during training should be:
val: ... val_pair: ...
Thanks for the reply. That is the way I have it in my file I just did not list it that way in the post. Just to be sure I ran it again with the same format (ie. train, train_pair, val, val_pair,test_test_pair) and I am still getting "WARNING: frames is empty."
@SandLeg Since I'm not able to reproduce your issue, could you fetch the latest 'dev' branch and do the following tests?
# pwd >VSRTorch
python train.py vespcn --cuda --dataset videopair --data_config ../UTest/data/fake_datasets.yml -v
# pwd >Train
python run.py --model vespcn --data_config ../UTest/data/fake_datasets.yml --dataset videopair -v
It seems your test is OK. So come back to the original issue, can it be related to mistake data URL?
I went back into datasets.yaml and edited my URL to make it similar to what you had in fake_datasets.yaml. I used quotes, used the root directory tag which I wasn't using previously and removing the end of URLs. Nothing that I did stopped the "WARNING: frames is empty. [size=10]" error from showing up
At the end it also says frames is empty in the unit test. Is that supposed to be there?
I recall that I assume that the validation data is no more than 1GB. If the size of them (all data sums up) is more than 1GB, may cause that problem.
1GB is hardcoded in Trainer.py
. You can change it as you wish.
The last line of tensorflow's execution is expected.
Delete some out-of-order reply
I recall that I assume that the validation data is no more than 1GB. If the size of them (all data sums up) is more than 1GB, may cause that problem.
1GB is hardcoded in Trainer.py. You can change it as you wish.
I recall that I assume that the validation data is no more than 1GB. If the size of them (all data sums up) is more than 1GB, may cause that problem.
1GB is hardcoded in
Trainer.py
. You can change it as you wish.The last line of tensorflow's execution is expected.
Yeah my training and validation set is fairly large. The validation set has around 200 images and comes in about 4.5 GB. The training set has over 400 images so it is probably much larger than that. I just changed the limit to 8GB and it seems like nothing changed Ill test it some more.
For training, you can use --memory_limit=???GB
to constrain memory usage during training.
For validation, I thought too many files could slow down the overall speed. If you have enough runtime memory, you could raise that value as I mentioned.
For training, you can use
--memory_limit=???GB
to constrain memory usage during training. For validation, I thought too many files could slow down the overall speed. If you have enough runtime memory, you could raise that value as I mentioned.
Yeah, that still did not do anything. So I tried running CARN and got a fatal file error with a specific line in the code. Maybe this will help you diagnose the problem:
2019-05-08 12:35:18,164 INFO: Fitting: [CARN]
2019-05-08 12:35:18,175 WARNING: frames is empty. [size=3200]
| 2019-05-08 12:35:18 | Epoch: 1/13 | LR: 0.0001 |
0batch [00:00, ?batch/s]
Exception in thread fetch_thread_7:
Traceback (most recent call last):
File "/user/miniconda3/envs/py36/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/user/miniconda3/envs/py36/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, *self._kwargs)
File "/user/VideoSuperResolution/VSR/DataLoader/Loader.py", line 261, in _prefetch
frames = self.parser[index interval:]
File "/user/VideoSuperResolution/VSR/DataLoader/Parser/custom_pairs.py", line 43, in getitem
ret += self.gen_frames(copy.deepcopy(vf), vf.frames)
File "/user/VideoSuperResolution/VSR/DataLoader/Parser/custom_pairs.py", line 72, in gen_frames
lr = [img for img in vf.pair.read_frame(depth)]
File "/user/VideoSuperResolution/VSR/DataLoader/VirtualFile.py", line 352, in read_frame
imagebytes = [BytesIO(self.read()) for in range(frames)]
File "/user/VideoSuperResolution/VSR/DataLoader/VirtualFile.py", line 352, in
Test: 100%|###############################################################| 10/10 [00:00<00:00, 30.97it/s] psnr: 11.185015,
Still not able to reproduce the problem. Have you pulled to the latest master
or dev
branch? What's your data exactly, I notice they are TIFF, can you send me a sample image?
custom_pairs
parer needs to support video well. In developing...
Still not able to reproduce the problem. Have you pulled to the latest
master
ordev
branch? What's your data exactly, I notice they are TIFF, can you send me a sample image?
I started on the master branch but I have been on the dev branch since you told me to run the tests. I'm not sure if I am able to share the images directly but they come from here http://xviewdataset.org/ . The frames come from a single image that is downsampled x amount of times using a convolutional filter. So the video would look like a sliding window through the original image if you looked at every frame in order. The data that I am trying to test only has four frames but I need to be able to work to up to 64 frames for other data that I have yet to create.
The images in the HR dataset are large also. Probably an average they are 1000x1000 images and the scale I am working with is 2 so the LR data is half that size on average
@SandLeg Fix related issue. Try at my latest dev
branch, and tell me anything goes wrong.
It's caused by a memory leak. You will need to use --memory_limit
to avoid consuming too much memory.
As for hardcoded 1GB for validation, you don't need to change that.
@SandLeg Fix related issue. Try at my latest
dev
branch, and tell me anything goes wrong. It's caused by a memory leak. You will need to use--memory_limit
to avoid consuming too much memory. As for hardcoded 1GB for validation, you don't need to change that.
I figured out part of the issue but I am still not getting it to run. The last folder in my validation and training set was empty which is probably why I was getting the "No File in Frame" issue. Now all the models hang up at the same point. I get the "WARNING: frames is empty. [size=3200]" message then it hangs for a couple of seconds and runs the test which was never done before. Here's whats happening now:
2019-05-10 09:55:36,456 INFO: Fitting: [VESPCN] 2019-05-10 09:55:36,476 WARNING: frames is empty. [size=3200] | 2019-05-10 09:55:36 | Epoch: 2/13 | LR: 0.0001 | 0batch [00:00, ?batch/s] Test: 100%|###############################################################| 10/10 [00:00<00:00, 17.85it/s] psnr: 14.660599, 2019-05-10 09:55:38,422 WARNING: frames is empty. [size=3200]
Do you have any suggestions on dealing with the memory issue? Should I try to decrease the size of my dataset and see if that will get it to run?
That's odd. You can try to start with a subset of your dataset, for example, 10 videos with 10 frames each.
I tried it with 10 folders with 4 frames each and it's still doing the same thing. Something must not be loading right. I really don't know. I'll try some more investigation on Monday.
Much beyond my expectation. What if using another dataset? For example, let's use VID4 to train.
Much beyond my expectation. What if using another dataset? For example, let's use VID4 to train.
Sorry for the late reply but I have been busy and I am just getting back to this today. I fixed the issue. My path in dataset.yaml was wrong, which combined with the previous issue with the missing data is probably the main source of the issue. Thanks for all of your help.
And while your reading this do you know how to configure training for multiple GPUs?
For the existing models, they don't support multi-GPUs. But if you're familiar with multi-GPU programming, you can code as usual.
The entire training process is encapsulated in Train
function, so you can manipulate your training procedure inside it. Moreover, you can overwrite Trainer
to control the data path as you like.
Close this thread since the problem is solved.
Tacking on to this, .tif files support 32 bit float values. In the case that the training set was composed of float arrays, rather than image-like integer arrays, .tif seems like a good choice. Would your code need substantial modifications to run with 32 bit float .tif files? Changing the .png output, mentioned in #57, would be one issue, I suspect.
I'm having some issues loading in my own training dataset into the various models. I added the directories of the training data into dataset.yaml with the following code: DATA: train: /home/users/user/dataset/dataHR/Train/ test: /home/users/user/dataset/dataHR/Val/ train_pair: /home/users/user/dataset/dataLR/Train/ test_pair: /home/users/user/dataset/dataLR/Val/ param: parser: custom_pairs
Inside of the data the folders have the structure: 0001: 0001_00.tif 0001_01.tif ... 0002: 0002_00.tif 0002_01.tif ... ...
When I run VESPCN in Pytorch the model is able to train but the testing images are not loaded in: | 2019-05-02 14:18:59 | Epoch: 5/5 | LR: 0.0001 | 100%|#######################################| 200/200 [01:03<00:00, 3.53batch/s, image=00.00097, flow=00.00035, tv=01.90621] | Epoch average image = 0.001059 | | Epoch average flow = 0.000375 | | Epoch average tv = 2.140414 | 2019-05-02 14:20:02,334 WARNING: frames is empty. [size=10] Test: 0it [00:00, ?it/s]
When I run VESPCN in tensorflow the same thing happens:
INFO:tensorflow:Fitting: VESPCN | 2019-05-02 14:26:59 | Epoch: 1/10 | LR: 0.0001 | 100%|###################################| 200/200 [00:42<00:00, 4.79batch/s, l2=23615.77148, loss=24443.99414, me=828.22260] | Epoch average l2 = 35973.230469 | | Epoch average loss = 36285.953125 | | Epoch average me = 312.723450 | frames is empty. [size=10] Test: 0it [00:00, ?it/s]
When I run FRSVR I get the following errors:
2019-05-02 14:21:51,380 WARNING: frames is empty. [size=3200] | 2019-05-02 14:21:51 | Epoch: 8/10 | LR: 0.0001 | 0batch [00:00, ?batch/s] 2019-05-02 14:21:51,387 WARNING: frames is empty. [size=10] Test: 0it [00:00, ?it/s]
Then when I run FRSVR without custom pairs it seems to train but the testing data does not load:
| 2019-05-02 14:44:21 | Epoch: 1/11 | LR: 0.0001 | 100%|#####################| 200/200 [01:23<00:00, 3.14batch/s, total_loss=00.00239, image_loss=00.00236, flow_loss=00.00006] | Epoch average total_loss = 0.004055 | | Epoch average image_loss = 0.003931 | | Epoch average flow_loss = 0.000247 | 2019-05-02 14:45:44,950 WARNING: frames is empty. [size=10]
Any type of advice or fix would be appreciated.