PaddlePaddle / continuous_evaluation

Macro Continuous Evaluation Platform for Paddle.
19 stars 14 forks source link

CE模型对齐 #45

Open guochaorong opened 6 years ago

guochaorong commented 6 years ago

CE模型添加多卡支持,待验证Model CE多卡加速比指标

guochaorong commented 6 years ago

对CE中模型进行梳理(见后面所附表),
模型如下: image_classification vgg16 mnist object_detection resnet30 resnet50 seq2seq sequence_tagging_for_ner text_classification transformer language_model lstm

需要考虑增加和对齐的内容如下:

  1. 模型都改成多卡跑(4卡)(后续,我把指定卡放到外边,单卡、多卡均跑一遍)

  2. 每个模型的评价指标需要包含这4个数据(acc/ppl,cost ,mem 和 duration)

  3. 目前只监控了上述4个评价指标的diff,我观察到两种非预期情况,1 .跑得时间很短, acc 很低(0.1),2. 跑了很多轮, acc很低(0.1,模型自身有问题)。 暂时方案, 我们将轮数很低的加长(跑30min左右),将acc都统一调到0.5以上。 (后续我加上acc基数阈值告警。)

  4. 数据集统一使用现成的(而不是每次都下载), 放在默认的/root/.cache/paddle/dataset目录

模型 数据集 Pass  轮数, 当前执行情况 评价指标 参数
Lstm 影评   Layers:words DynamicRNN paddle.dataset.imdb as imdb     http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz 1轮 Pass = 0, Iter = 49, Loss = 0.713064, Accuracy = 0.593750     nvidia-smi --id=%s --query-compute-apps=used_memory --format=csv -lms 1 > memory.txt imdb_32_train_speed imdb_32_gpu_memory batch_size: 32 device: GPU emb_dim: 512 gpu_id: 0 hidden_dim: 512 iterations: 50 skip_batch_num: 5
object_detection dataset: pascalvoc 和coco 数据集 指定在/data/目录, 但没有 Pass轮数:2 IOError: [Errno 2] No such file or directory: '/data/pascalvoc/label_list' 需要在/data目录防止数据 train_cost_kpi train_speed_kpi batch_size: 64 is_toy: 0 iterations: 120 learning_rate: 0.001 num_passes: 2 parallel: True use_gpu: True
Resnet50 Flowers cifar     http://www.robots.ox.ac.uk/~vgg/ data/flowers/102/102flowers.tgz Pass 轮数:29(不收敛) Pass:2, Loss:3.229035, Train Accuray:0.247656, Test Accuray:0.176471, Handle Images Duration: 63.949636 cifar10_128_train_acc_kpi, cifar10_128_train_speed_kpi,   cifar10_128_gpu_memory_kpi, flowers_64_train_speed_kpi,   flowers_64_gpu_memory_kpi,   起了个线程取mem信息, 并没有评价acc等 batch_size: 64 data_format: NCHW data_set: flowers device: GPU infer_only: False iterations: 80 model: resnet_imagenet pass_num: 3 skip_batch_num: 5

Pass:29, Loss:0.026319, Train Accuray:0.993359, Test Accuray:0.559400,  Handle Images Duration: 22.501337 language_model | /root/.cache/paddle/dataset/imikolov/ simple-examples.tgz |   | ppl:61.667 time_cost(s):18.544248 |   |   sequence_tagging_for_ner | 数据集 http://cs224d.stanford.edu/assignment2/ assignment2.zip | Pass轮数: 22轮 | download data error! 增加目录data后ok [TestSet] pass_id:2200 【pass num 每次增加100】pass_precision:[0.18181819] pass_recall:[0.125] pass_f1_score:[0.14814815] | train_acc_kpi,  pass_duration_kpi, |   text_classification | Imdb http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz | Pass:14 | avg_acc: 0.999800, avg_cost: 0.002255 |   |   Vgg16 | flowers/imagelabels.mat http://www.robots.ox.ac.uk/~vgg/data/ flowers/102/imagelabels.mat | 1轮 | cifar10 Pass: 1, Loss: 1.810090, Train Accuray: 0.234375 | cifar10_128_train_speed_kpi,  cifar10_128_gpu_memory_kpi, flowers_32_train_speed_kpi,   flowers_32_gpu_memory_kpi, 起了个线程取mem信息, 并没有评价acc等 |   Pass: 49, Loss: 3.561218, Train Accuray: 0.093750   |   |   |   |   |