Closed xipengshen closed 2 years ago
@EthanGuan Can you take a look?
I found out the reason for that error. The custom AI training script is using our out-of-date standard. I updated the script. This error is gone, but some new error appeared. I'll close this issue and open a new one for the new error.
Problem
When I tried to run the compatiblity test of the latest XGen (without expiration problem) on a custom AI model, I encountered a model loading error.
Files for reproduce
The custom AI model is the mnist model that Pu posted: https://github.com/CoCoPIE-Group/XGen-Report/files/9693404/xgen_mnist.zip
Log
Please choose your device(s):
Please choose the base model (and the default dataset) to start with: Image classification I: EfficientNet (ImageNet)
Image classification II: ResNet (ImageNet)
Image classification III: ResNet (Cifar)
Image classification IV: MobileNet (ImageNet)
Object detection: Yolov5 (CoCo)
Segmentation: UNet (ISBI-2012)
Video classification: R2+1d (UCF101)
Video super resolution: WDSR (DIV2K)
Have you set up the environment needed to train your model in this container?
Have you revised your training script for XGen by following the XGen manual?
If you haven't run the compatibility testing, you are recommended to select that option in the following questions. XGen config file (absolute path): /root/Projects/xgen_mnist/xgen.json Training script folder (absolute path): /root/Projects/xgen_mnist/ What do you want to do?
You chose "Compatibility test". We will create a temporary workplace for you, and will delete it after the testing. 2022-10-03 14:00:15,181 - root - INFO - AIMET Your current workplace is /tmp/b3cc6ec24c4646129820d4d72401239f A new search is started! ****config summary**** xgen-config-path: /tmp/b3cc6ec24c4646129820d4d72401239f/xgen_config.json xgen-workplace: /tmp/b3cc6ec24c4646129820d4d72401239f xgen-resume: False xgen-mode: compatible_testing xgen-pretrained-model-path: ./checkpoint/ckpt.pth Detail args: {'origin': {'common_train_epochs': 200, 'root_path': './workplace/', 'pretrain_model_weights_path': None, 'train_data_path': './data', 'train_label_path': None, 'eval_data_path': './data', 'eval_label_path': None, 'scaling_factor': 2, 'num_classes': 10, 'batch_size': 128}, 'general': {'user_id': 'test', 'work_place': '/tmp/b3cc6ec24c4646129820d4d72401239f', 'random_seed': None, 'enable_ddp': False, 'CUDA_VISIBLE_DEVICES': '0', 'tran_scripts_path': None}, 'prune': {'sp_store_weights': None, 'sp_lars': False, 'sp_lars_trust_coef': 0.001, 'sp_backbone': False, 'sp_retrain': False, 'sp_admm': False, 'sp_admm_multi': False, 'sp_retrain_multi': False, 'sp_config_file': None, 'sp_subset_progressive': False, 'sp_admm_fixed_params': False, 'sp_no_harden': False, 'nv_sparse': False, 'sp_load_prune_params': None, 'sp_store_prune_params': None, 'generate_rand_seq_gap_yaml': False, 'sp_admm_update_epoch': 5, 'sp_admm_update_batch': None, 'sp_admm_rho': 0.001, 'sparsity_type': 'block_punched', 'sp_admm_lr': 0.01, 'admm_debug': False, 'sp_global_weight_sparsity': False, 'sp_prune_threshold': -1.0, 'sp_block_irregular_sparsity': '(0,0)', 'sp_block_permute_multiplier': 2, 'sp_admm_block': '(8,4)', 'sp_admm_buckets_num': 16, 'sp_admm_elem_per_row': 1, 'sp_admm_tile': None, 'sp_admm_select_number': 4, 'sp_admm_pattern_row_sub': 1, 'sp_admm_pattern_col_sub': 4, 'sp_admm_data_format': None, 'sp_admm_do_not_permute_conv': False, 'sp_gs_output_v': None, 'sp_gs_output_ptr': None, 'sp_load_frozen_weights': None, 'retrain_mask_pattern': 'weight', 'sp_update_init_method': 'weight', 'sp_mask_update_freq': 10, 'retrain_mask_sparsity': -1.0, 'retrain_mask_seed': None, 'sp_prune_before_retrain': False, 'output_compressed_format': False, 'sp_grad_update': False, 'sp_grad_decay': 0.98, 'sp_grad_restore_threshold': -1, 'sp_global_magnitude': False, 'sp_pre_defined_mask_dir': None, 'sp_prune_ratios': 0, 'admm_sparsity_type': 'block_punched', 'admm_block': '(8,4)', 'prune_threshold': -1.0}, 'quantization': {'qt_aimet': False, 'qat': True, 'fold_layers': True, 'cross_layer_equalization': False, 'bias_correction': True, 'rounding_mode': 'nearest', 'num_quant_samples': 1000, 'num_bias_correct_samples': 1000, 'weight_bw': 8, 'act_bw': 8, 'quant_scheme': 'tf_enhanced', 'layers_to_ignore': [], 'auto_add_bias': True, 'perform_only_empirical_bias_corr': True}, 'task': {'specific_scenarios': 'BasicTest', 'pretrained_model_path': './checkpoint/ckpt.pth', 'state': {'stage': 0, 'cycles': 0}, 'max_searching': 10}, 'user_requirements': {'power': None, 'accuracy': None, 'accuracy_reverse_yn': 0, 'model_size': None, 'memory_size': None, 'latency': 0.1, 'margin': 0.1, 'primary_type': 'latency', 'primary_range': '>', 'secondary_type': 'accuracy', 'secondary_range': '<', 'searching_variable': 'scaling_factor', 'searching_range': [1, 23], 'searching_step_size': 1, 'target_type': 'latency'}, 'train': {'common_save_best_yn': 1, 'trained_yn': False, 'larger_better': True}, 'compiler': {'input_shape': '(1, 1, 28, 28)', 'opset_version': 11, 'devices': ['RF8M428R6AK']}, 'distillation': {'distillation_method': None, 'enable_ddp': False, 'enable_dp': False, 'input_shape': None, 'original_loss_weights': 0.1, 'tag_loss_weights': 0.9, 'tag_loss': 'kl', 'tag_temperature': 4, 'tag_loss_combination_method': 'avg', 'feature_loss_weights': 0.9, 'feature_default_temperature': 1, 'advance_feature_mapping': {}, 'regularization_loss_weights': 1, 'regularization_loss_types': [], 'discriminator_lr': 0.0001}} Current search has 1 stages Stage: 1 Max search cycles: 1 ****config summary**** Current state: stage: 1/1| cycles: 1/1 Total jobs:1 processing job 1/1 Training... MKL_THREADING_LAYER=GNU CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 python train_script_main.py 2022-10-03 14:00:16,288 - root - INFO - AIMET Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz 9913344it [00:00, 33112378.88it/s]
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz 29696it [00:00, 19239118.26it/s]
Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz 1649664it [00:00, 10896211.41it/s]
Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz 5120it [00:00, 16869470.92it/s]
Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw
/root/miniconda3/envs/xgen/lib/python3.7/site-packages/torchvision/datasets/mnist.py:498: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:180.) return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s) Traceback (most recent call last): File "train_script_main.py", line 151, in
training_main(args_ai)
File "train_script_main.py", line 130, in training_main
xgen_load(model,args_ai=args_ai)
File "/root/miniconda3/envs/xgen/lib/python3.7/site-packages/xgen_tools-1.0.1-py3.7.egg/xgen_tools/xgen_load.py", line 263, in xgen_load
Exception: args_ai or path can not be both none
Traceback (most recent call last):
File "xgen_scripts.py", line 147, in main
xgen(training_main, run, training_script_path=training_script_path, log_path=log_path)
File "/root/miniconda3/envs/xgen/lib/python3.7/site-packages/xgen_main-1.0.17-py3.7.egg/xgen/xgen_run.py", line 364, in xgen
internal_data = train_module(job, training_main)
File "/root/miniconda3/envs/xgen/lib/python3.7/site-packages/xgen_main-1.0.17-py3.7.egg/xgen/train_module.py", line 169, in train_module
args_ai = model_train_main(job, training_main)
File "/root/miniconda3/envs/xgen/lib/python3.7/site-packages/xgen_main-1.0.17-py3.7.egg/xgen/train_module.py", line 148, in model_train_main
raise Exception('Training failed')
Exception: Training failed