CalvinYang0 / CRNet

35 stars 7 forks source link

the code stop after showing ---------- Networks initialized ------------ #6

Open CharisWg opened 3 weeks ago

CharisWg commented 3 weeks ago

After running ./train_track1.sh in the terminal, the code stops after displaying '----Networks initialized....'. I've included the settings from my train_track1.sh script. It appears that it is not progressing to the train.py stage. I am attempting to train CRNet on my personal dataset. Can you suggest where the steps or settings might be wrong?

!/bin/bash

echo "Start to train the model...." dataroot="./data/NTIRE_Val/" # including 'Train' and 'NTIRE_Val' floders device='0' name="coca" build_dir="./ckpt/"$name

if [ ! -d "$build_dir" ]; then mkdir $build_dir fi

LOG=./ckpt/$name/date +%Y-%m-%d-%H-%M-%S.txt

LOG="./ckpt/$name/$(date +%Y-%m-%d-%H-%M-%S).txt"

echo "Using GPU with ID: $device"

python train.py \ --dataset_name bracketire \ --model cat \ --name $name \ --lr_policy step\ --patch_size 128 \ --niter 400 \ --save_imgs True\ --lr 1e-4 \ --dataroot $dataroot \ --batch_size 36 \ --print_freq 500 \ --calc_metrics True \ --weight_decay 0.01 \ --gpu_ids $device \ -j 8 \ --lr_decay_iters 27 \ --block Convnext \ --load_optimizers False\ | tee $LOG

CalvinYang0 commented 3 weeks ago

First, you may need to modify the ‘dataroot’ to your own data path. Second, you might want to check if there is any issue with the format of your dataset. Additionally, another possibility is that you are experiencing a CPU bottleneck or GPU bottleneck, resulting in insufficient iteration counts to print training information. You can check the usage of your CPU and GPU. If it’s a CPU bottleneck, you could manually augment your dataset instead of doing it during training. If it’s a GPU bottleneck, you could reduce the number of parameters in your model.

CharisWg commented 3 weeks ago

Thank you for your reply. I am using a GPU and trying to replicate your code with the dataset from your shared link, which is BracketIRE. The code stops after displaying '----Networks initialized....' and does not proceed to the training step. I am wondering if there might be some steps I did wrong when I tried to replicate your code.

CalvinYang0 commented 3 weeks ago

I noticed that the last level of your ‘dataroot’ directory is ‘NTIRE_Val’. I suggest you try changing it to the parent directory, ‘./data’. Under the ‘dataroot’ directory, there should be two subfolders: ‘Train’ and ‘NTIRE_Val’. Since you didn’t receive any error messages and the process stopped at network initialization, it’s hard to diagnose the potential issue. Perhaps you could try using the BracketIRE code framework and then migrate our CRNet to that framework (although my framework should be consistent with theirs). Lastly, I apologize for any confusion. This is my first open-source code, so there are many imperfections in various aspects.

CharisWg commented 3 weeks ago

Thank you for your answer. I fixed the issue. You are doing well and explain things kindly and carefully