ML-TANGO / TANGO

public repo for TANGO (Target Aware No-code neural network Generation and Operation framework)
Other
68 stars 20 forks source link

AutoNN 과정에서 api 체크 무한루프 #128

Closed sejeong98105 closed 7 months ago

sejeong98105 commented 7 months ago

AutoNN 서비스를 실행했을 경우 API 상태 체크만 계속 반복되고 AutoNN 서비스가 종료되지 않습니다.

아래는 AutoNN 실행 시 발생하는 로그입니다.

yoloe- status_request response : running
2023-12-04T04:30:11.885048073Z _________GET /start_____________
2023-12-04T04:30:11.890739950Z new user or project
2023-12-04T04:30:11.896907894Z /shared/datasets/coco/dataset.yaml /shared/common/user01/8/project_info.yaml
2023-12-04T04:30:11.896946907Z 1-th process is starting
2023-12-04T04:30:11.929382642Z PROCESS ID TYPE CHECK(before save):  <class 'str'> 589509
2023-12-04T04:30:11.933765364Z PROCESS ID TYPE CHECK(after save) :  <class 'str'> 589509
2023-12-04T04:30:11.940405699Z [04/Dec/2023 04:30:11] "GET /start?user_id=user01&project_id=8 HTTP/1.1" 200 9
2023-12-04T04:30:11.945199650Z {'acc': 'cuda', 'batchsize': 255, 'cpu': 'x86', 'dataset': 'coco', 'engine': 'pytorch', 'input_source': 0, 'lightweight_level': 5, 'memory': 78, 'nfs_ip': None, 'nfs_path': None, 'os': 'ubuntu', 'output_method': 0, 'precision_level': 5, 'target_hostip': None, 'target_hostport': None, 'target_info': 'PC', 'target_serviceport': None, 'task_type': 'detection', 'user_editing': False}
2023-12-04T04:30:12.054042615Z {'acc': 'cuda', 'batchsize': 255, 'cpu': 'x86', 'dataset': 'coco', 'engine': 'pytorch', 'input_source': 0, 'lightweight_level': 5, 'memory': 78, 'nfs_ip': None, 'nfs_path': None, 'os': 'ubuntu', 'output_method': 0, 'precision_level': 5, 'target_hostip': None, 'target_hostport': None, 'target_info': 'PC', 'target_serviceport': None, 'task_type': 'detection', 'user_editing': False}
2023-12-04T04:30:16.781902142Z _________GET /status_request_____________
2023-12-04T04:30:16.787420420Z found thread running yoloe
2023-12-04T04:30:16.792924597Z [04/Dec/2023 04:30:16] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:30:18.668003768Z YOLOR 🚀 2023-11-30 torch 1.13.1+cu117 CUDA:0 (Tesla V100-SXM2-16GB, 16160.5MB)
2023-12-04T04:30:18.668050396Z                                       CUDA:1 (Tesla V100-SXM2-16GB, 16160.5MB)
2023-12-04T04:30:18.668060785Z                                       CUDA:2 (Tesla V100-SXM2-16GB, 16160.5MB)
2023-12-04T04:30:18.668069169Z                                       CUDA:3 (Tesla V100-SXM2-16GB, 16160.5MB)
2023-12-04T04:30:18.668077116Z                                       CUDA:4 (Tesla V100-SXM2-16GB, 16160.5MB)
2023-12-04T04:30:18.668085119Z 
2023-12-04T04:30:18.668803687Z Namespace(adam=False, artifact_alias='latest', batch_size=255, bbox_interval=-1, bucket='', cache_images=False, cfg='/shared/common/user01/8/basemodel.yaml', data='/shared/datasets/coco/dataset.yaml', device='0,1,2,3,4', entity='None', epochs=1, evolve=False, exist_ok=False, freeze=[0], global_rank=-1, hyp=PosixPath('/source/yoloe_core/yolov7_utils/data/hyp.scratch.p5.yaml'), image_weights=False, img_size=[640, 640], label_smoothing=0.0, linear_lr=False, local_rank=-1, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/exp7', save_period=-1, single_cls=False, sync_bn=False, total_batch_size=255, upload_dataset=False, v5_metric=False, weights='', workers=8, world_size=1)
2023-12-04T04:30:18.669013651Z tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
2023-12-04T04:30:20.107816176Z hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.3, cls_pw=1.0, obj=0.7, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.2, scale=0.9, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.15, copy_paste=0.0, paste_in=0.15, loss_ota=1
2023-12-04T04:30:20.138384346Z wandb: Install Weights & Biases for YOLOR logging with 'pip install wandb' (recommended)
2023-12-04T04:30:20.180976025Z 
2023-12-04T04:30:20.181011033Z                  from  n    params  module                                  arguments                     
2023-12-04T04:30:20.184765637Z   0                -1  1       928  models.common.Conv                      [3, 32, 3, 2, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.185222343Z   1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.185592537Z   2                -1  1      2112  models.common.Conv                      [64, 32, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.186016556Z   3                -2  1      2112  models.common.Conv                      [64, 32, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.186350871Z   4                -1  1      9280  models.common.Conv                      [32, 32, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.186746301Z   5                -1  1      9280  models.common.Conv                      [32, 32, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.186830332Z   6  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
2023-12-04T04:30:20.187247721Z   7                -1  1      8320  models.common.Conv                      [128, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.187373415Z   8                -1  1         0  models.common.MP                        []                            
2023-12-04T04:30:20.187722544Z   9                -1  1      4224  models.common.Conv                      [64, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.188091556Z  10                -2  1      4224  models.common.Conv                      [64, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.188867921Z  11                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.189312434Z  12                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.189383665Z  13  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
2023-12-04T04:30:20.189965611Z  14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.190075692Z  15                -1  1         0  models.common.MP                        []                            
2023-12-04T04:30:20.190552270Z  16                -1  1     16640  models.common.Conv                      [128, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.190994072Z  17                -2  1     16640  models.common.Conv                      [128, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.192643843Z  18                -1  1    147712  models.common.Conv                      [128, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.194073721Z  19                -1  1    147712  models.common.Conv                      [128, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.194118986Z  20  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
2023-12-04T04:30:20.195398465Z  21                -1  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.195433741Z  22                -1  1         0  models.common.MP                        []                            
2023-12-04T04:30:20.196320372Z  23                -1  1     66048  models.common.Conv                      [256, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.197136229Z  24                -2  1     66048  models.common.Conv                      [256, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.201475014Z  25                -1  1    590336  models.common.Conv                      [256, 256, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.206228152Z  26                -1  1    590336  models.common.Conv                      [256, 256, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.206263727Z  27  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
2023-12-04T04:30:20.210513626Z  28                -1  1    525312  models.common.Conv                      [1024, 512, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.211910965Z  29                -1  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.213092290Z  30                -2  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.213216113Z  31                -1  1         0  models.common.SP                        [5]                           
2023-12-04T04:30:20.213286446Z  32                -2  1         0  models.common.SP                        [9]                           
2023-12-04T04:30:20.213380577Z  33                -3  1         0  models.common.SP                        [13]                          
2023-12-04T04:30:20.213442578Z  34  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
2023-12-04T04:30:20.215733352Z  35                -1  1    262656  models.common.Conv                      [1024, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.215828801Z  36          [-1, -7]  1         0  models.common.Concat                    [1]                           
2023-12-04T04:30:20.217223393Z  37                -1  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.217708372Z  38                -1  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.217846905Z  39                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
2023-12-04T04:30:20.218431041Z  40                21  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.218504837Z  41          [-1, -2]  1         0  models.common.Concat                    [1]                           
2023-12-04T04:30:20.218973571Z  42                -1  1     16512  models.common.Conv                      [256, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.219436841Z  43                -2  1     16512  models.common.Conv                      [256, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.220065448Z  44                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.220677262Z  45                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.220758357Z  46  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
2023-12-04T04:30:20.221357585Z  47                -1  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.221748433Z  48                -1  1      8320  models.common.Conv                      [128, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.221865897Z  49                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
2023-12-04T04:30:20.222310134Z  50                14  1      8320  models.common.Conv                      [128, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.222388880Z  51          [-1, -2]  1         0  models.common.Concat                    [1]                           
2023-12-04T04:30:20.222756352Z  52                -1  1      4160  models.common.Conv                      [128, 32, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.223131172Z  53                -2  1      4160  models.common.Conv                      [128, 32, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.223532576Z  54                -1  1      9280  models.common.Conv                      [32, 32, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.223940778Z  55                -1  1      9280  models.common.Conv                      [32, 32, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.224022153Z  56  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
2023-12-04T04:30:20.224414039Z  57                -1  1      8320  models.common.Conv                      [128, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.225293184Z  58                -1  1     73984  models.common.Conv                      [64, 128, 3, 2, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.225372996Z  59          [-1, 47]  1         0  models.common.Concat                    [1]                           
2023-12-04T04:30:20.225835363Z  60                -1  1     16512  models.common.Conv                      [256, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.226305796Z  61                -2  1     16512  models.common.Conv                      [256, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.226913290Z  62                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.227518082Z  63                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.227599219Z  64  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
2023-12-04T04:30:20.228232902Z  65                -1  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.230702082Z  66                -1  1    295424  models.common.Conv                      [128, 256, 3, 2, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.230799268Z  67          [-1, 37]  1         0  models.common.Concat                    [1]                           
2023-12-04T04:30:20.231606883Z  68                -1  1     65792  models.common.Conv                      [512, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.232461108Z  69                -2  1     65792  models.common.Conv                      [512, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.233869962Z  70                -1  1    147712  models.common.Conv                      [128, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.235296791Z  71                -1  1    147712  models.common.Conv                      [128, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.235373488Z  72  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
2023-12-04T04:30:20.236795540Z  73                -1  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.237571946Z  74                57  1     73984  models.common.Conv                      [64, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.240049691Z  75                65  1    295424  models.common.Conv                      [128, 256, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.248057005Z  76                73  1   1180672  models.common.Conv                      [256, 512, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
2023-12-04T04:30:20.251063506Z  77      [74, 75, 76]  1    230906  yoloe_core.yolov7_utils.models.yolo.IDetect[80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
2023-12-04T04:30:20.344889161Z Model Summary: 263 layers, 6228762 parameters, 6228762 gradients
2023-12-04T04:30:20.344917314Z 
2023-12-04T04:30:21.797998540Z _________GET /status_request_____________
2023-12-04T04:30:21.802806193Z found thread running yoloe
2023-12-04T04:30:21.806000703Z [04/Dec/2023 04:30:21] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:30:22.146079155Z Scaled weight_decay = 0.0019921875
2023-12-04T04:30:22.149485405Z Optimizer groups: 58 .bias, 58 conv.weight, 61 other
2023-12-04T04:30:22.182514113Z /shared/datasets/coco/images/train2017
2023-12-04T04:30:22.183535247Z /shared/datasets/coco/labels/train2017.cache
2023-12-04T04:30:22.187035725Z 
train: Scanning '/shared/datasets/coco/labels/train2017.cache' images and labels... 126 found, 2 missing, 0 empty, 0 corrupted: 100%|██████████| 128/128 [00:00<?, ?it/s]
train: Scanning '/shared/datasets/coco/labels/train2017.cache' images and labels... 126 found, 2 missing, 0 empty, 0 corrupted: 100%|██████████| 128/128 [00:00<?, ?it/s]
2023-12-04T04:30:22.450767264Z /shared/datasets/coco/labels/val2017.cache
2023-12-04T04:30:22.452156906Z 
val: Scanning '/shared/datasets/coco/labels/val2017.cache' images and labels... 1 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 1/1 [00:00<?, ?it/s]
val: Scanning '/shared/datasets/coco/labels/val2017.cache' images and labels... 1 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 1/1 [00:00<?, ?it/s]
2023-12-04T04:30:22.456168777Z 
2023-12-04T04:30:22.485197186Z autoanchor: Analyzing anchors... anchors/target = 4.26, Best Possible Recall (BPR) = 0.9946
2023-12-04T04:30:22.503216155Z Image sizes 640 train, 640 test
2023-12-04T04:30:22.503243058Z Using 8 dataloader workers
2023-12-04T04:30:22.503247355Z Logging results to runs/train/exp7
2023-12-04T04:30:22.503250574Z Starting training for 1 epochs...
2023-12-04T04:30:22.567480107Z 
2023-12-04T04:30:22.567509080Z      Epoch   gpu_mem       box       obj       cls     total    labels  img_size
2023-12-04T04:30:27.517549410Z _________GET /status_request_____________
2023-12-04T04:30:27.522335032Z found thread running yoloe
2023-12-04T04:30:27.525959869Z 
  0%|          | 0/1 [00:00<?, ?it/s][04/Dec/2023 04:30:27] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:30:32.503983202Z _________GET /status_request_____________
2023-12-04T04:30:32.507280049Z found thread running yoloe
2023-12-04T04:30:32.511110141Z [04/Dec/2023 04:30:32] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:30:37.519344316Z _________GET /status_request_____________
2023-12-04T04:30:37.524103860Z found thread running yoloe
2023-12-04T04:30:37.527315411Z [04/Dec/2023 04:30:37] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:30:42.528501565Z _________GET /status_request_____________
2023-12-04T04:30:42.533544841Z found thread running yoloe
2023-12-04T04:30:42.536408212Z [04/Dec/2023 04:30:42] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:30:47.523453825Z _________GET /status_request_____________
2023-12-04T04:30:47.528396995Z found thread running yoloe
2023-12-04T04:30:47.532118022Z [04/Dec/2023 04:30:47] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:30:52.525611947Z _________GET /status_request_____________
2023-12-04T04:30:52.530483399Z found thread running yoloe
2023-12-04T04:30:52.533713550Z [04/Dec/2023 04:30:52] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:30:57.515998936Z _________GET /status_request_____________
2023-12-04T04:30:57.521377835Z found thread running yoloe
2023-12-04T04:30:57.524758515Z [04/Dec/2023 04:30:57] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:31:02.574110126Z _________GET /status_request_____________
2023-12-04T04:31:02.580010030Z found thread running yoloe
2023-12-04T04:31:02.583730767Z [04/Dec/2023 04:31:02] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:31:07.523316001Z _________GET /status_request_____________
2023-12-04T04:31:07.528081198Z found thread running yoloe
2023-12-04T04:31:07.531503393Z [04/Dec/2023 04:31:07] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:31:12.543751611Z _________GET /status_request_____________
2023-12-04T04:31:12.548536315Z found thread running yoloe
2023-12-04T04:31:12.552105626Z [04/Dec/2023 04:31:12] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:31:16.795070220Z _________GET /status_request_____________
2023-12-04T04:31:16.799783400Z found thread running yoloe
2023-12-04T04:31:16.803267779Z [04/Dec/2023 04:31:16] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:31:21.798406730Z _________GET /status_request_____________
2023-12-04T04:31:21.804842581Z found thread running yoloe
2023-12-04T04:31:21.807498948Z [04/Dec/2023 04:31:21] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:31:26.800366116Z _________GET /status_request_____________
2023-12-04T04:31:26.807110703Z found thread running yoloe
2023-12-04T04:31:26.811152087Z [04/Dec/2023 04:31:26] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:31:31.808306711Z _________GET /status_request_____________
2023-12-04T04:31:31.813065274Z found thread running yoloe
2023-12-04T04:31:31.816523792Z [04/Dec/2023 04:31:31] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:31:36.795474085Z _________GET /status_request_____________
2023-12-04T04:31:36.800358895Z found thread running yoloe
2023-12-04T04:31:36.803633572Z [04/Dec/2023 04:31:36] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:31:41.796601340Z _________GET /status_request_____________
2023-12-04T04:31:41.801367495Z found thread running yoloe
2023-12-04T04:31:41.804911790Z [04/Dec/2023 04:31:41] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:31:46.797726802Z _________GET /status_request_____________
2023-12-04T04:31:46.802546371Z found thread running yoloe
2023-12-04T04:31:46.806782167Z [04/Dec/2023 04:31:46] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:31:51.803608412Z _________GET /status_request_____________
2023-12-04T04:31:51.808255326Z found thread running yoloe
2023-12-04T04:31:51.811666379Z [04/Dec/2023 04:31:51] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:31:56.795605913Z _________GET /status_request_____________
2023-12-04T04:31:56.800279993Z found thread running yoloe
2023-12-04T04:31:56.803731470Z [04/Dec/2023 04:31:56] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:32:01.798397371Z _________GET /status_request_____________
2023-12-04T04:32:01.803072857Z found thread running yoloe
2023-12-04T04:32:01.806488474Z [04/Dec/2023 04:32:01] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:32:06.796752103Z _________GET /status_request_____________
2023-12-04T04:32:06.802290005Z found thread running yoloe
2023-12-04T04:32:06.805316259Z [04/Dec/2023 04:32:06] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:32:12.522723576Z _________GET /status_request_____________
2023-12-04T04:32:12.527493318Z found thread running yoloe
2023-12-04T04:32:12.531449519Z [04/Dec/2023 04:32:12] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
yoloe- status_request response : running
2023-12-04T04:32:17.522074741Z _________GET /status_request_____________
2023-12-04T04:32:17.526749307Z found thread running yoloe
2023-12-04T04:32:17.530105057Z [04/Dec/2023 04:32:17] "GET /status_request?user_id=user01&project_id=8 HTTP/1.1" 200 9
iksooman commented 7 months ago

현재 학습 중단 UI가 만들어져 있지 않은데, 해당 기능은 추가 예정입니다. 현재 상태에서 학습을 중단하시려면, TANGO 폴더에서

docker-compose down

를 커맨드 창에 입력하셔서 컨테이너를 내리면 프로세스도 내릴 수 있습니다.

현재 프로세스가 running 중이라고 되어 있는데 nvidia-smi 명령어를 통해 실제로 프로세스가 살아있는지 확인이 필요할 것 같습니다.

sejeong98105 commented 7 months ago

nvidia-smi 명령어를 통해 프로세스가 죽은 것은 확인했는데 원인을 잘 모르겠습니다.

iksooman commented 7 months ago

현재 추측하기로는, BMS에서 자동 설정한 batch size에 비해 실제 데이터 갯수가 적은 상황 때문에 발생한 에러로 보입니다. 혹시 가능하다면 GPU 1장만 사용해서 테스트를 부탁드립니다. 아래와 같이 docker-compose.yml 파일의 100번째 라인과 198번째 라인의 NVIDIA_VISIBLE_DEVICES0으로 변경 후 docker-compose up -d 하시면 됩니다.

NVIDIA_VISIBLE_DEVICES=0