atnikos / teach

Official PyTorch implementation of the paper "TEACH: Temporal Action Compositions for 3D Humans" [3DV 2022]
https://teach.is.tue.mpg.de
Other
388 stars 41 forks source link

Training Process #27

Closed XinandYu closed 1 year ago

XinandYu commented 1 year ago

"You are using a SMPL+H model, with only 10 shape coefficients." the waring comes when I process the dataset. Is that warning normal?

And when I retrained model, the loss was always Nan. Is there something wrong with the dataset?

XinandYu commented 1 year ago

Training Log is here

13/12/22 10:22:49][teach.callback.progress][INFO] - Training started [13/12/22 10:23:42][teach.callback.progress][INFO] - Epoch 0: Train_rf 4.855e-01 Memory 9.7% [13/12/22 10:24:34][teach.callback.progress][INFO] - Epoch 1: Train_rf 4.854e-01 Memory 9.8% [13/12/22 10:25:27][teach.callback.progress][INFO] - Epoch 2: Train_rf 4.855e-01 Memory 9.7% [13/12/22 10:26:19][teach.callback.progress][INFO] - Epoch 3: Train_rf 4.855e-01 Memory 9.7% [13/12/22 10:27:12][teach.callback.progress][INFO] - Epoch 4: Train_rf 4.855e-01 Memory 9.8% [13/12/22 10:28:04][teach.callback.progress][INFO] - Epoch 5: Train_rf 4.855e-01 Memory 10.0% [13/12/22 10:28:59][teach.callback.progress][INFO] - Epoch 6: Train_rf 4.855e-01 Memory 9.8% [13/12/22 10:29:51][teach.callback.progress][INFO] - Epoch 7: Train_rf 4.855e-01 Memory 9.8% [13/12/22 10:30:44][teach.callback.progress][INFO] - Epoch 8: Train_rf 4.855e-01 Memory 9.8% [13/12/22 10:31:36][teach.callback.progress][INFO] - Epoch 9: Train_rf 4.855e-01 Memory 9.8% [13/12/22 10:32:29][teach.callback.progress][INFO] - Epoch 10: Train_rf 4.855e-01 Memory 9.8% [13/12/22 10:33:22][teach.callback.progress][INFO] - Epoch 11: Train_rf 4.855e-01 Memory 9.8% [13/12/22 10:34:15][teach.callback.progress][INFO] - Epoch 12: Train_rf 4.856e-01 Memory 9.8% [13/12/22 10:35:07][teach.callback.progress][INFO] - Epoch 13: Train_rf 4.856e-01 Memory 9.8% [13/12/22 10:36:00][teach.callback.progress][INFO] - Epoch 14: Train_rf 4.855e-01 Memory 9.8% [13/12/22 10:36:53][teach.callback.progress][INFO] - Epoch 15: Train_rf 4.856e-01 Memory 9.8% [13/12/22 10:37:46][teach.callback.progress][INFO] - Epoch 16: Train_rf 4.856e-01 Memory 9.8% [13/12/22 10:38:39][teach.callback.progress][INFO] - Epoch 17: Train_rf 4.855e-01 Memory 9.9% [13/12/22 10:39:32][teach.callback.progress][INFO] - Epoch 18: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:40:25][teach.callback.progress][INFO] - Epoch 19: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:41:18][teach.callback.progress][INFO] - Epoch 20: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:42:11][teach.callback.progress][INFO] - Epoch 21: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:43:04][teach.callback.progress][INFO] - Epoch 22: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:43:56][teach.callback.progress][INFO] - Epoch 23: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:44:49][teach.callback.progress][INFO] - Epoch 24: Train_rf 4.856e-01 Memory 9.9% [13/12/22 10:45:42][teach.callback.progress][INFO] - Epoch 25: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:46:35][teach.callback.progress][INFO] - Epoch 26: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:47:28][teach.callback.progress][INFO] - Epoch 27: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:48:21][teach.callback.progress][INFO] - Epoch 28: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:49:14][teach.callback.progress][INFO] - Epoch 29: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:50:07][teach.callback.progress][INFO] - Epoch 30: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:51:00][teach.callback.progress][INFO] - Epoch 31: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:51:53][teach.callback.progress][INFO] - Epoch 32: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:52:46][teach.callback.progress][INFO] - Epoch 33: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:53:39][teach.callback.progress][INFO] - Epoch 34: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:54:32][teach.callback.progress][INFO] - Epoch 35: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:55:25][teach.callback.progress][INFO] - Epoch 36: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:56:18][teach.callback.progress][INFO] - Epoch 37: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:57:11][teach.callback.progress][INFO] - Epoch 38: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:58:04][teach.callback.progress][INFO] - Epoch 39: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:58:56][teach.callback.progress][INFO] - Epoch 40: Train_rf 4.856e-01 Memory 10.0% [13/12/22 10:59:49][teach.callback.progress][INFO] - Epoch 41: Train_rf 4.856e-01 Memory 9.9% [13/12/22 11:00:42][teach.callback.progress][INFO] - Epoch 42: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:01:35][teach.callback.progress][INFO] - Epoch 43: Train_rf 4.856e-01 Memory 9.9% [13/12/22 11:02:28][teach.callback.progress][INFO] - Epoch 44: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:03:21][teach.callback.progress][INFO] - Epoch 45: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:04:14][teach.callback.progress][INFO] - Epoch 46: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:05:07][teach.callback.progress][INFO] - Epoch 47: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:06:00][teach.callback.progress][INFO] - Epoch 48: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:06:53][teach.callback.progress][INFO] - Epoch 49: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:07:46][teach.callback.progress][INFO] - Epoch 50: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:08:39][teach.callback.progress][INFO] - Epoch 51: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:09:32][teach.callback.progress][INFO] - Epoch 52: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:10:24][teach.callback.progress][INFO] - Epoch 53: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:11:17][teach.callback.progress][INFO] - Epoch 54: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:12:10][teach.callback.progress][INFO] - Epoch 55: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:13:03][teach.callback.progress][INFO] - Epoch 56: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:13:56][teach.callback.progress][INFO] - Epoch 57: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:14:49][teach.callback.progress][INFO] - Epoch 58: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:15:42][teach.callback.progress][INFO] - Epoch 59: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:16:35][teach.callback.progress][INFO] - Epoch 60: Train_rf 4.856e-01 Memory 9.9% [13/12/22 11:17:28][teach.callback.progress][INFO] - Epoch 61: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:18:21][teach.callback.progress][INFO] - Epoch 62: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:19:14][teach.callback.progress][INFO] - Epoch 63: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:20:07][teach.callback.progress][INFO] - Epoch 64: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:21:00][teach.callback.progress][INFO] - Epoch 65: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:21:53][teach.callback.progress][INFO] - Epoch 66: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:22:46][teach.callback.progress][INFO] - Epoch 67: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:23:39][teach.callback.progress][INFO] - Epoch 68: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:24:32][teach.callback.progress][INFO] - Epoch 69: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:25:24][teach.callback.progress][INFO] - Epoch 70: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:26:18][teach.callback.progress][INFO] - Epoch 71: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:27:11][teach.callback.progress][INFO] - Epoch 72: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:28:03][teach.callback.progress][INFO] - Epoch 73: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:28:56][teach.callback.progress][INFO] - Epoch 74: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:29:49][teach.callback.progress][INFO] - Epoch 75: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:30:42][teach.callback.progress][INFO] - Epoch 76: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:31:35][teach.callback.progress][INFO] - Epoch 77: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:32:28][teach.callback.progress][INFO] - Epoch 78: Train_rf 4.856e-01 Memory 9.9% [13/12/22 11:33:21][teach.callback.progress][INFO] - Epoch 79: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:34:14][teach.callback.progress][INFO] - Epoch 80: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:35:07][teach.callback.progress][INFO] - Epoch 81: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:36:00][teach.callback.progress][INFO] - Epoch 82: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:36:54][teach.callback.progress][INFO] - Epoch 83: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:37:47][teach.callback.progress][INFO] - Epoch 84: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:38:39][teach.callback.progress][INFO] - Epoch 85: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:39:32][teach.callback.progress][INFO] - Epoch 86: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:40:25][teach.callback.progress][INFO] - Epoch 87: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:41:18][teach.callback.progress][INFO] - Epoch 88: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:42:11][teach.callback.progress][INFO] - Epoch 89: Train_rf 4.856e-01 Memory 9.9% [13/12/22 11:43:04][teach.callback.progress][INFO] - Epoch 90: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:43:57][teach.callback.progress][INFO] - Epoch 91: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:44:50][teach.callback.progress][INFO] - Epoch 92: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:45:43][teach.callback.progress][INFO] - Epoch 93: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:46:36][teach.callback.progress][INFO] - Epoch 94: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:47:29][teach.callback.progress][INFO] - Epoch 95: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:48:22][teach.callback.progress][INFO] - Epoch 96: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:49:16][teach.callback.progress][INFO] - Epoch 97: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:50:09][teach.callback.progress][INFO] - Epoch 98: Train_rf 4.856e-01 Memory 10.0% [13/12/22 11:51:18][teach.callback.progress][INFO] - Epoch 99: Train_rf 4.856e-01 Val_rf 4.784e-01 Memory 11.2% [13/12/22 11:52:11][teach.callback.progress][INFO] - Epoch 100: Train_rf 4.856e-01 Val_rf 4.784e-01 Memory 11.2% [13/12/22 11:53:05][teach.callback.progress][INFO] - Epoch 101: Train_rf 4.856e-01 Val_rf 4.784e-01 Memory 11.2% [13/12/22 11:53:57][teach.callback.progress][INFO] - Epoch 102: Train_rf 4.856e-01 Val_rf 4.784e-01 Memory 11.2% [13/12/22 11:54:50][teach.callback.progress][INFO] - Epoch 103: Train_rf 4.856e-01 Val_rf 4.784e-01 Memory 11.2% [13/12/22 11:55:43][teach.callback.progress][INFO] - Epoch 104: Train_rf 4.856e-01 Val_rf 4.784e-01 Memory 11.2% [13/12/22 11:56:36][teach.callback.progress][INFO] - Epoch 105: Train_rf 4.856e-01 Val_rf 4.784e-01 Memory 11.2% [13/12/22 11:57:30][teach.callback.progress][INFO] - Epoch 106: Train_rf 4.856e-01 Val_rf 4.784e-01 Memory 11.3% [13/12/22 11:58:23][teach.callback.progress][INFO] - Epoch 107: Train_rf 4.856e-01 Val_rf 4.784e-01 Memory 11.2% [13/12/22 11:59:16][teach.callback.progress][INFO] - Epoch 108: Train_rf 4.856e-01 Val_rf 4.784e-01 Memory 11.2% [13/12/22 12:00:09][teach.callback.progress][INFO] - Epoch 109: Train_rf 4.856e-01 Val_rf 4.784e-01 Memory 11.2% [13/12/22 12:01:02][teach.callback.progress][INFO] - Epoch 110: Train_rf 4.856e-01 Val_rf 4.784e-01 Memory 11.2% [13/12/22 12:01:55][teach.callback.progress][INFO] - Epoch 111: Train_rf 4.856e-01 Val_rf 4.784e-01 Memory 11.2% [13/12/22 12:02:48][teach.callback.progress][INFO] - Epoch 112: Train_rf 4.856e-01 Val_rf 4.784e-01 Memory 11.2%\

The log seems strange.

atnikos commented 1 year ago

"You are using a SMPL+H model, with only 10 shape coefficients." the waring comes when I process the dataset. Is that warning normal?

And when I retrained model, the loss was always Nan. Is there something wrong with the dataset?

This warning comes from the official smpl package so there is nothing to worry there. I cannot see the NaN you are describing but the loss is not decreasing which is weird indeed. Can you post the command you use for training? I will try to check it locally and get back to you.

XinandYu commented 1 year ago

I run the training process with:

python train.py

XinandYu commented 1 year ago

And nan is the loss at the end of progress bar and it seems we didn't print it in the log. Anyway, there must be something wrong with the training.

And thanks for your response!

atnikos commented 1 year ago
image

Can you post a screenshot of the full logs to see if everything is loaded and done properly? Including the initial dataloading etc. I attached my logs, the loss is decreasing and training works normally.

XinandYu commented 1 year ago

Yeah, it seems something went wrong with my code. I use a larger batch size(16). The log is like this 截图 2022-12-14 15-57-34

atnikos commented 1 year ago

Seems like you have a very good GPU.🔥 Bigger batch size should work (potentially even better). But you should increase the learning rate by 2 since you doubled the batch size. Try that and let know

atnikos commented 1 year ago

Except batch size did you change something else? Are you use the same pytorch and lightning and torchmetrics versions? This I critical for the loss

XinandYu commented 1 year ago

Yeah. It's an error caused by torchmetrics version. Correct version of torchmetric makes it works now(though torch version is 1.15 but still work). Thanks for your help.