Closed kwanwoo02 closed 1 year ago
@kwanwoo02 I think training is running but it is not able to utilize your gpu properly due to gpu architecture mismatch. For this you can try to install latest version for Pytorch(2.0.1)+cuda(11.8) with toolkit 12.2. It will then be able to use your A100 gpu to the fullest and also you'll be able to see progress bar for the training.
Also you can always watch your CPU, RAM or GPU usage if you are confused that training is being running or not.
TY @amangupta2303 !
hello I'm not sure if training is in progress. Is there a way to check this?
I followed the following method: readme_create_mydataset
After preparing the dataset as follows, I entered the command
python3 bin/train.py -cn big-lama location=my_dataset data.batch_size=10
.My server spec has 4 A100 40GB. The log appears like this and there are no other changes. I want to know if the train is currently running.
The logs so far are as follows: