IBM / mini-era

Mini-ERA is a simplified still-representative version of the main ERA workload.
14 stars 15 forks source link

Missing path in setup_paths.sh #15

Closed wohlbier closed 2 years ago

wohlbier commented 2 years ago

In setup_paths.sh on the docker-hpvm-yolo branch APPROXHPVM_DIR is set to a path does not exist in the build. This line in particular:

export APPROXHPVM_DIR=/home/espuser/hpvm-release/approxhpvm-nvdla
(hpvm) [espuser@cca607aa590e mini-era]$ ls /home/espuser/hpvm-release/approxhpvm-nvdla
ls: cannot access /home/espuser/hpvm-release/approxhpvm-nvdla: No such file or directory

https://github.com/IBM/mini-era/blob/docker-hpvm-yolo/setup_paths.sh

wohlbier commented 2 years ago
Quick heads-up. 
While doing some testing we discovered a bug in the dockerfile that prevents the container from running the mini-era application. The application should run under the base conda environment, instead of the hpvm one: $conda activate base. The Dockerfile has been patched accordingly, and only affects the last layer, so no need to regenerate the entire image.

Regarding APPROXHPVM_DIR path. This is part of the mini-era directory, which is in the docker image for legacy reasons, and not intended to be run or compiled. In Step 2 of the previous email, we will produce a clean image with only the necessary files. For now, this version of the docker image runs the scheduler version of mini-era.  

cd scheduler-library/examples/mini-era && ./test-scheduler-S-P3V0F0N0 -t traces/tt00.new -o -G config_files/base_me_p2.config

Ok, so I should not cd to mini-era and issue ‘make’. Those are the builds that are failing, but it sounds like they are not expected to run.

I'll inspect the output of the scheduler mini era to try to get a handle on the yolo model.

I did notice that the example dumps core at the end.

Accelerator Usage Statistics:

Per-Accelerator allocation/usage statistics:
 Acc_Type 0 CPU-Acc : Accel  0 Allocated  16088 times
 Acc_Type 0 CPU-Acc : Accel  1 Allocated   3867 times
 Acc_Type 0 CPU-Acc : Accel  2 Allocated     45 times

Per-Accelerator-Type allocation/usage statistics:
 Acc_Type 0 CPU-Acc Allocated  20000 times
 Acc_Type 1 FFT-HW-Acc Allocated      0 times
 Acc_Type 2 VIT-HW-Acc Allocated      0 times
 Acc_Type 3 CV-HW-Acc Allocated      0 times

Per-Meta-Block Accelerator allocation/usage statistics:
 Per-MB Acc_Type 0 CPU-Acc : Accel  0 Allocated   3024 times for MB29
 Per-MB Acc_Type 0 CPU-Acc : Accel  0 Allocated   3064 times for MB30
 Per-MB Acc_Type 0 CPU-Acc : Accel  0 Allocated  10000 times for MB31
 Per-MB Acc_Type 0 CPU-Acc : Accel  1 Allocated   1948 times for MB29
 Per-MB Acc_Type 0 CPU-Acc : Accel  1 Allocated   1919 times for MB30
 Per-MB Acc_Type 0 CPU-Acc : Accel  2 Allocated     28 times for MB29
 Per-MB Acc_Type 0 CPU-Acc : Accel  2 Allocated     17 times for MB30

Done.
In the cleanup-state routine...
Doing accelerator type closeout for 4 accelerators
Segmentation fault (core dumped)
dtrilla commented 2 years ago

Thanks for bringing this up.

The segmentation fault has been patched in commit ceb11ff of the scheduler library.

You will need to regenerate the image, at least from the layer where the scheduler library is cloned. We also made some updates to the last layer of the Dockerfile and how it should run. The README has been updated accordingly.

To run the docker the following line should be used now. (Changed the path of yolo-data) docker run -uespuser -v “$(pwd)”/yolo-data:/home/espuser/mini-era/yolo-data:ro --rm -it <name>:<tag> /bin/bash