ctuning / ck-mlperf

This repository is outdated! Join the open MLPerf workgroup to participate in the development of the next generation of automation workflows for MLPerf benchmarks:
https://bit.ly/mlperf-edu-wg
BSD 3-Clause "New" or "Revised" License
32 stars 23 forks source link

Implement a script for cross-checking two image classification experiments #5

Closed psyhtest closed 5 years ago

psyhtest commented 5 years ago

The script can search for all experiments with specific tags (e.g. mlperf,image-classification,mobilenet-v1-1.0-224) and offer to select a pair of them for comparison (e.g. one from a reference implementation and another from a vendor optimised one; or one from an floating-point implementation and another from a quantised one).

For example, the reference MobileNet implementation can be benchmarked as follows:

$ ck benchmark program:image-classification-tflite \
--repetitions=10 --env.CK_BATCH_SIZE=1 --env.CK_BATCH_COUNT=2 \
--record --record_repo=local --record_uoa=mlperf-mobilenet-v1-1.00-224-tflite-0.1.7-performance \
--tags=mlperf,image-classification,mobilenet-v1-1.0-224,tflite-0.1.7,performance \
--skip_print_timers --skip_stat_analysis --process_multi_keys
psyhtest commented 5 years ago

To make it most useful for debugging, we should compare at least 5 topmost predictions for each image (frame). Right now, the postprocessing script used for program:image-classification-tf* records a list of frame_predictions as follows:

        "frame_predictions": [
          {
            "accuracy_top1": "yes",
            "accuracy_top5": "yes",
            "class_correct": 65,
            "class_topmost": 65,
            "file_name": "ILSVRC2012_val_00000001.JPEG"
          },
          {
            "accuracy_top1": "yes",
            "accuracy_top5": "yes",
            "class_correct": 970,
            "class_topmost": 970,
            "file_name": "ILSVRC2012_val_00000002.JPEG"
          },

Perhaps we should convert it into a dictionary where the key is file_name and the value is an augmented dictionary.

psyhtest commented 5 years ago

By an augmented dictionary I meant a dictionary retaining the accuracy_top1/5 and class_correct keys but converting class_topmost into an ordered list of tuples (class, probability), where the first element of the list has the highest probability, etc. This will allow us to flag discrepancies between expements when either classes mismatch or probabilities differ more than a certain configurable delta e.g. 0.01 by default. The number of topmost classes to include can also be made configurable e.g. 5 by default.

psyhtest commented 5 years ago

I've started with copying ck-tensorflow:script:cached-benchmark to ck-mlperf:script:image-classification.

$ ck find script:b98ee24399ef4c3a
/home/anton/CK_REPOS/ck-mlperf/script/image-classification
psyhtest commented 5 years ago

To use the new script in a client program, locally replace 689867d1939a781d with b98ee24399ef4c3a in its metadata e.g.:

$ ck find program:image-classification-tflite
/home/anton/CK_REPOS/ck-tensorflow/program/image-classification-tflite
$ cd /home/anton/CK_REPOS/ck-tensorflow/program/image-classification-tflite/
$ vim .cm/meta.json
:%s/689867d1939a781d/b98ee24399ef4c3a
:w
$ git diff
$ git diff
diff --git a/program/image-classification-tflite/.cm/meta.json b/program/image-classification-tflite/.cm/meta.json
index 9cf5b9e..299fb5d 100644
--- a/program/image-classification-tflite/.cm/meta.json
+++ b/program/image-classification-tflite/.cm/meta.json
@@ -47,11 +47,11 @@
       "run_time": {
         "fine_grain_timer_file": "tmp-ck-timer.json",
         "post_process_cmds": [
-          "python $#ck_take_from_{script:689867d1939a781d}#$postprocess.py"
+          "python $#ck_take_from_{script:b98ee24399ef4c3a}#$postprocess.py"
         ],
         "post_process_via_ck": "yes",
         "pre_process_via_ck": {
-          "data_uoa": "689867d1939a781d",
+          "data_uoa": "b98ee24399ef4c3a",
           "module_uoa": "script",
           "script_name": "preprocess"
         },

NB: Since the new script is experimental, I'd rather not update any client programs globally yet

psyhtest commented 5 years ago

The first change updates frame_predictions from a list to a dictionary:

$ ck benchmark program:image-classification-tflite \
--repetitions=1  --env.CK_BATCH_SIZE=1 --env.CK_BATCH_COUNT=5 \
--record --record_repo=local --record_uoa=mlperf-mobilenet-v1-1.00-224-tflite-0.1.7-accuracy-test \
--tags=mlperf,image-classification,mobilenet-v1-1.0-224,tflite-0.1.7,accuracy,test \
--skip_print_timers --skip_stat_analysis --process_multi_keys
...
***************************************************************************************                                                                                                                     
Pipeline executed successfully!                                                                                                                                                                             
***************************************************************************************                                                                                                                     
Recording experiment ...                                                                                                                                                                                    

  Adding/updating entry ...                                                                                                                                                                                 
  Loading and locking entry (mlperf-mobilenet-v1-1.00-224-tflite-0.1.7-accuracy-test) ...                                                                                                                   
  Loaded and locked successfully (lock UID=6233c9897a4326b9) ...                                                                                                                                 
  Prepared new point 0dd418e927da1d9e ...                                                                                                                                                         
        Processing characteristic point 1 out of 1 ...                         
      Subpoint: 1                               
  Updating entry and unlocking ...                                                                                                         
                                                                                                                                                                                                            Recorded successfully in 0.23 secs.                                           
***************************************************************************************                                                    
Done!                                                                                   
...
$ ck find experiment:mlperf-mobilenet-v1-1.00-224-tflite-0.1.7-accuracy-test
/home/anton/CK_REPOS/local/experiment/mlperf-mobilenet-v1-1.00-224-tflite-0.1.7-accuracy-test
$ cd `ck find experiment:mlperf-mobilenet-v1-1.00-224-tflite-0.1.7-accuracy-test`
$ vim ckp-0dd418e927da1d9e.0001.json
...
        "frame_predictions": {
          "ILSVRC2012_val_00000001.JPEG": {
            "accuracy_top1": "yes",
            "accuracy_top5": "yes",
            "class_correct": 65,
            "class_topmost": 65,
            "file_name": "ILSVRC2012_val_00000001.JPEG"
          },
          "ILSVRC2012_val_00000002.JPEG": {
            "accuracy_top1": "no",
            "accuracy_top5": "yes",
            "class_correct": 970,
            "class_topmost": 795,
            "file_name": "ILSVRC2012_val_00000002.JPEG"
          },
...
psyhtest commented 5 years ago

The script has been implemented and successfully used to validate ArmNN ports of MLPerf workloads.