请问为什么运行run-vnet.sh后，没有保存训练模型到best_model

Lipanw commented 2 years ago

请问为什么运行run-vnet.sh后，没有保存训练模型到best_model。train.log也没有任何内容

Lipanw commented 2 years ago

我是在windows10系统运行的

shiyutang commented 2 years ago

请问你运行了多长时间呢，有其他信息么～

Lipanw commented 2 years ago

Traceback (most recent call last): File "train.py", line 204, in main(args) File "train.py", line 198, in main to_static_training=cfg.to_static_training) File "F:\fuxianCode\MedicalSeg-develop\medicalseg\core\train.py", line 233, in train save_dir=save_dir) File "F:\fuxianCode\MedicalSeg-develop\medicalseg\core\val.py", line 151, in evaluate 'format': "xyz" File "F:\fuxianCode\MedicalSeg-develop\medicalseg\utils\utils.py", line 244, in save_array img_itk_new = sitk.GetImageFromArray(val) File "D:\mysoftware\Anaconda\lib\site-packages\SimpleITK\extra.py", line 292, in GetImageFromArray id = _get_sitk_pixelid(z) File "D:\mysoftware\Anaconda\lib\site-packages\SimpleITK\extra.py", line 189, in _get_sitk_pixelid raise TypeError('dtype: {0} is not supported.'.format(numpy_array_type.dtype)) TypeError: dtype: int32 is not supported.

Lipanw commented 2 years ago

2022-04-27 17:47:02 [INFO] [TRAIN] epoch: 0, iter: 100/15000, loss: 2.4868, DSC: 4.1360, lr: 0.009941, batch_cost: 0.7021, reader_cost: 0.00082, ips: 1.4244 samples/sec | ETA 02:54:20 2022-04-27 17:48:13 [INFO] [TRAIN] epoch: 1, iter: 200/15000, loss: 1.1843, DSC: 4.3465, lr: 0.009881, batch_cost: 0.7081, reader_cost: 0.00062, ips: 1.4123 samples/sec | ETA 02:54:39 2022-04-27 17:49:24 [INFO] [TRAIN] epoch: 2, iter: 300/15000, loss: 1.1282, DSC: 4.3768, lr: 0.009820, batch_cost: 0.7096, reader_cost: 0.00016, ips: 1.4092 samples/sec | ETA 02:53:51 2022-04-27 17:50:35 [INFO] [TRAIN] epoch: 2, iter: 400/15000, loss: 1.1043, DSC: 4.3364, lr: 0.009760, batch_cost: 0.7107, reader_cost: 0.00047, ips: 1.4071 samples/sec | ETA 02:52:56 2022-04-27 17:51:46 [INFO] [TRAIN] epoch: 3, iter: 500/15000, loss: 1.0901, DSC: 4.3506, lr: 0.009700, batch_cost: 0.7109, reader_cost: 0.00047, ips: 1.4066 samples/sec | ETA 02:51:48 2022-04-27 17:51:46 [INFO] Start evaluating (total_samples: 5, total_iters: 5)...

Lipanw commented 2 years ago

每次都是运行到500，要进行模型评估的时候就停止运行了

linhandev commented 2 years ago

听起来是验证的时候有点问题，在issue之后我们代码有更新，可以pull一下，save_interval开小一点尝试一下

shiyutang commented 2 years ago

这部分是在评估过程中保存存在问题，你可以先注释掉save_array部分开始训练，然后在这附上完整的可复现代码链接/修改的部分说明。

Lipanw commented 2 years ago

2022-05-06 14:56:46 [INFO] [TRAIN] epoch: 4, iter: 100/15000, loss: 4.4847, DSC: 3.7124, lr: 0.000994, batch_cost: 6.5770, reader_cost: 2.26782, ips: 0.9123 samples/sec | ETA 27:13:17 您好，之前的问题已经解决，但是相对于您在首页给的lr=0.001的例子DSC为什么这么低呢，loss也很高

Lipanw commented 2 years ago

以下是我的配置信息 ------------Environment Information------------- platform: Linux-4.15.0-158-generic-x86_64-with-debian-stretch-sid Python: 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0] Paddle compiled with cuda: True NVCC: Build cuda_11.2.r11.2/compiler.29618528_0 cudnn: 8.2 GPUs used: 1 CUDA_VISIBLE_DEVICES: None GPU: ['GPU 0: A100-SXM4-40GB (UUID:'] GCC: gcc (Ubuntu 7.5.0-3ubuntu1~16.04) 7.5.0 PaddlePaddle: 2.2.2

2022-05-06 14:45:45 [INFO]
---------------Config Information--------------- batch_size: 6 data_root: tools/data iters: 15000 loss: coef:

1 types:
coef:
- 1
- 1 losses:
- type: CrossEntropyLoss weight: null
- type: DiceLoss type: MixedLoss lr_scheduler: decay_steps: 15000 end_lr: 0 learning_rate: 0.001 power: 0.9 type: PolynomialDecay model: elu: false in_channels: 1 num_classes: 3 pretrained: null type: VNet optimizer: momentum: 0.9 type: sgd weight_decay: 0.0001 train_dataset: dataset_root: lung_coronavirus/lung_coronavirus_phase0 mode: train num_classes: 3 result_dir: lung_coronavirus/lung_coronavirus_phase1 transforms:
scale:
- 0.8
- 1.2 size: 128 type: RandomResizedCrop3D
degrees: 90 type: RandomRotation3D
type: RandomFlip3D type: LungCoronavirus val_dataset: dataset_json_path: lung_coronavirus/lung_coronavirus_raw/dataset.json dataset_root: lung_coronavirus/lung_coronavirus_phase0 mode: val num_classes: 3 result_dir: lung_coronavirus/lung_coronavirus_phase1 transforms: [] type: LungCoronavirus

2022-05-06 14:56:46 [INFO] [TRAIN] epoch: 4, iter: 100/15000, loss: 4.4847, DSC: 3.7124, lr: 0.000994, batch_cost: 6.5770, reader_cost: 2.26782, ips: 0.9123 samples/sec | ETA 27:13:17 2022-05-06 15:07:41 [INFO] [TRAIN] epoch: 8, iter: 200/15000, loss: 3.5398, DSC: 3.8685, lr: 0.000988, batch_cost: 6.5488, reader_cost: 2.25564, ips: 0.9162 samples/sec | ETA 26:55:22 2022-05-06 15:18:36 [INFO] [TRAIN] epoch: 12, iter: 300/15000, loss: 2.8668, DSC: 3.9746, lr: 0.000982, batch_cost: 6.5445, reader_cost: 2.25206, ips: 0.9168 samples/sec | ETA 26:43:24

linhandev commented 2 years ago

lr可能可以适当大一点

shiyutang commented 2 years ago

一个问题可以只开一个issue。另外看上去是数据的问题，是否有修改数据处理部分的代码呢？或者罗列下你都进行了什么修改？

PaddleCV-SIG / MedicalSeg

请问为什么运行run-vnet.sh后，没有保存训练模型到best_model #71

type: RandomFlip3D type: LungCoronavirus val_dataset: dataset_json_path: lung_coronavirus/lung_coronavirus_raw/dataset.json dataset_root: lung_coronavirus/lung_coronavirus_phase0 mode: val num_classes: 3 result_dir: lung_coronavirus/lung_coronavirus_phase1 transforms: [] type: LungCoronavirus