xipengshen commented 2 years ago

Problem

When I tried to run the compatiblity test of the latest XGen (without expiration problem) on a custom AI model, I encountered a model loading error.

Files for reproduce

The custom AI model is the mnist model that Pu posted: https://github.com/CoCoPIE-Group/XGen-Report/files/9693404/xgen_mnist.zip

Log

Please choose your device(s):

[*] RF8M428R6AK model:SM_G973U
Press or for multi-selection, and or letter key and to move, to accept.

Please choose the base model (and the default dataset) to start with: Image classification I: EfficientNet (ImageNet)
Image classification II: ResNet (ImageNet)
Image classification III: ResNet (Cifar)
Image classification IV: MobileNet (ImageNet)
Object detection: Yolov5 (CoCo)
Segmentation: UNet (ISBI-2012)
Video classification: R2+1d (UCF101)
Video super resolution: WDSR (DIV2K)

Your own model

Have you set up the environment needed to train your model in this container?

Yes
No

Have you revised your training script for XGen by following the XGen manual?

Yes
No

If you haven't run the compatibility testing, you are recommended to select that option in the following questions. XGen config file (absolute path): /root/Projects/xgen_mnist/xgen.json Training script folder (absolute path): /root/Projects/xgen_mnist/ What do you want to do?

Compatibility test
Pruning
Scaling
Customized operation

You chose "Compatibility test". We will create a temporary workplace for you, and will delete it after the testing. 2022-10-03 14:00:15,181 - root - INFO - AIMET Your current workplace is /tmp/b3cc6ec24c4646129820d4d72401239f A new search is started! ****config summary**** xgen-config-path: /tmp/b3cc6ec24c4646129820d4d72401239f/xgen_config.json xgen-workplace: /tmp/b3cc6ec24c4646129820d4d72401239f xgen-resume: False xgen-mode: compatible_testing xgen-pretrained-model-path: ./checkpoint/ckpt.pth Detail args: {'origin': {'common_train_epochs': 200, 'root_path': './workplace/', 'pretrain_model_weights_path': None, 'train_data_path': './data', 'train_label_path': None, 'eval_data_path': './data', 'eval_label_path': None, 'scaling_factor': 2, 'num_classes': 10, 'batch_size': 128}, 'general': {'user_id': 'test', 'work_place': '/tmp/b3cc6ec24c4646129820d4d72401239f', 'random_seed': None, 'enable_ddp': False, 'CUDA_VISIBLE_DEVICES': '0', 'tran_scripts_path': None}, 'prune': {'sp_store_weights': None, 'sp_lars': False, 'sp_lars_trust_coef': 0.001, 'sp_backbone': False, 'sp_retrain': False, 'sp_admm': False, 'sp_admm_multi': False, 'sp_retrain_multi': False, 'sp_config_file': None, 'sp_subset_progressive': False, 'sp_admm_fixed_params': False, 'sp_no_harden': False, 'nv_sparse': False, 'sp_load_prune_params': None, 'sp_store_prune_params': None, 'generate_rand_seq_gap_yaml': False, 'sp_admm_update_epoch': 5, 'sp_admm_update_batch': None, 'sp_admm_rho': 0.001, 'sparsity_type': 'block_punched', 'sp_admm_lr': 0.01, 'admm_debug': False, 'sp_global_weight_sparsity': False, 'sp_prune_threshold': -1.0, 'sp_block_irregular_sparsity': '(0,0)', 'sp_block_permute_multiplier': 2, 'sp_admm_block': '(8,4)', 'sp_admm_buckets_num': 16, 'sp_admm_elem_per_row': 1, 'sp_admm_tile': None, 'sp_admm_select_number': 4, 'sp_admm_pattern_row_sub': 1, 'sp_admm_pattern_col_sub': 4, 'sp_admm_data_format': None, 'sp_admm_do_not_permute_conv': False, 'sp_gs_output_v': None, 'sp_gs_output_ptr': None, 'sp_load_frozen_weights': None, 'retrain_mask_pattern': 'weight', 'sp_update_init_method': 'weight', 'sp_mask_update_freq': 10, 'retrain_mask_sparsity': -1.0, 'retrain_mask_seed': None, 'sp_prune_before_retrain': False, 'output_compressed_format': False, 'sp_grad_update': False, 'sp_grad_decay': 0.98, 'sp_grad_restore_threshold': -1, 'sp_global_magnitude': False, 'sp_pre_defined_mask_dir': None, 'sp_prune_ratios': 0, 'admm_sparsity_type': 'block_punched', 'admm_block': '(8,4)', 'prune_threshold': -1.0}, 'quantization': {'qt_aimet': False, 'qat': True, 'fold_layers': True, 'cross_layer_equalization': False, 'bias_correction': True, 'rounding_mode': 'nearest', 'num_quant_samples': 1000, 'num_bias_correct_samples': 1000, 'weight_bw': 8, 'act_bw': 8, 'quant_scheme': 'tf_enhanced', 'layers_to_ignore': [], 'auto_add_bias': True, 'perform_only_empirical_bias_corr': True}, 'task': {'specific_scenarios': 'BasicTest', 'pretrained_model_path': './checkpoint/ckpt.pth', 'state': {'stage': 0, 'cycles': 0}, 'max_searching': 10}, 'user_requirements': {'power': None, 'accuracy': None, 'accuracy_reverse_yn': 0, 'model_size': None, 'memory_size': None, 'latency': 0.1, 'margin': 0.1, 'primary_type': 'latency', 'primary_range': '>', 'secondary_type': 'accuracy', 'secondary_range': '<', 'searching_variable': 'scaling_factor', 'searching_range': [1, 23], 'searching_step_size': 1, 'target_type': 'latency'}, 'train': {'common_save_best_yn': 1, 'trained_yn': False, 'larger_better': True}, 'compiler': {'input_shape': '(1, 1, 28, 28)', 'opset_version': 11, 'devices': ['RF8M428R6AK']}, 'distillation': {'distillation_method': None, 'enable_ddp': False, 'enable_dp': False, 'input_shape': None, 'original_loss_weights': 0.1, 'tag_loss_weights': 0.9, 'tag_loss': 'kl', 'tag_temperature': 4, 'tag_loss_combination_method': 'avg', 'feature_loss_weights': 0.9, 'feature_default_temperature': 1, 'advance_feature_mapping': {}, 'regularization_loss_weights': 1, 'regularization_loss_types': [], 'discriminator_lr': 0.0001}} Current search has 1 stages Stage: 1 Max search cycles: 1 ****config summary**** Current state: stage: 1/1| cycles: 1/1 Total jobs:1 processing job 1/1 Training... MKL_THREADING_LAYER=GNU CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 python train_script_main.py 2022-10-03 14:00:16,288 - root - INFO - AIMET Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz 9913344it [00:00, 33112378.88it/s]
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz 29696it [00:00, 19239118.26it/s]
Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz 1649664it [00:00, 10896211.41it/s]
Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz 5120it [00:00, 16869470.92it/s]
Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw

/root/miniconda3/envs/xgen/lib/python3.7/site-packages/torchvision/datasets/mnist.py:498: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:180.) return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s) Traceback (most recent call last): File "train_script_main.py", line 151, in training_main(args_ai) File "train_script_main.py", line 130, in training_main xgen_load(model,args_ai=args_ai) File "/root/miniconda3/envs/xgen/lib/python3.7/site-packages/xgen_tools-1.0.1-py3.7.egg/xgen_tools/xgen_load.py", line 263, in xgen_load Exception: args_ai or path can not be both none Traceback (most recent call last): File "xgen_scripts.py", line 147, in main xgen(training_main, run, training_script_path=training_script_path, log_path=log_path) File "/root/miniconda3/envs/xgen/lib/python3.7/site-packages/xgen_main-1.0.17-py3.7.egg/xgen/xgen_run.py", line 364, in xgen internal_data = train_module(job, training_main) File "/root/miniconda3/envs/xgen/lib/python3.7/site-packages/xgen_main-1.0.17-py3.7.egg/xgen/train_module.py", line 169, in train_module args_ai = model_train_main(job, training_main) File "/root/miniconda3/envs/xgen/lib/python3.7/site-packages/xgen_main-1.0.17-py3.7.egg/xgen/train_module.py", line 148, in model_train_main raise Exception('Training failed') Exception: Training failed

xipengshen commented 2 years ago

@EthanGuan Can you take a look?

xipengshen commented 2 years ago

I found out the reason for that error. The custom AI training script is using our out-of-date standard. I updated the script. This error is gone, but some new error appeared. I'll close this issue and open a new one for the new error.

CoCoPIE-Group / XGen-Report

custom AI model loading error #21

Problem

Files for reproduce

Log