GreenWaves-Technologies / gap_sdk

SDK for Greenwaves Technologies' GAP8 IoT Application Processor
https://greenwaves-technologies.com/en/gap8-the-internet-of-things-iot-application-processor/
Apache License 2.0
140 stars 78 forks source link

Performance counters for each core during multicore processing #157

Closed MoIbs-tech closed 4 years ago

MoIbs-tech commented 4 years ago

I'm trying to see if I can measure the performance counters (PCs) for each core while processing a CNN model thats generated using nntool. I set the no.of cores in gap_ncore() and the performance improvement works. But when I try to measure PCs like PERF_LD inorder to view the concurrent processing of ld instructions, I am getting this problem where for 1 core I am getting a value and for the rest of the 7 out of 8 active cores are zero. The PC measurement code was based on the this example: https://github.com/GreenWaves-Technologies/gap_sdk/blob/master/examples/pmsis/test_periph/perf/perf.c

and this is my main application for running the cnn and measruing the PCs to view any bugs in it:

`

include "pmsis.h"

/ RT for GPIO settings /

include "rt/rt_api.h"

include "Gap.h"

/ Autotiler includes. /

include "cifar10Kernels.h"

include "gaplib/ImgIO.h"

/ To include voltage, FC_FREQ, CL_FREQ and GPIO def /

include "/home/tp19021/gap_sdk/gap_volt_fc_cl_header.h"

define AT_INPUT_WIDTH 32

define AT_INPUT_HEIGHT 32

define AT_INPUT_COLORS 3

define AT_INPUT_SIZE (AT_INPUT_WIDTHAT_INPUT_HEIGHTAT_INPUT_COLORS)

typedef signed char IMAGE_IN_T;

// #define AT_INPUT_SIZE 64643 signed char Input_1[AT_INPUT_SIZE] = {0};

signed short int Output_1[10];

// #ifndef STACK_SIZE // values are from CLUSTER_STACK_SIZE AND CLUSTER_SLAVE_STACK_SIZE // from makefile in visual_wake

ifndef STACK_SIZE

define STACK_SIZE 2048

endif

// #endif

// #define PERF

undef PERF

AT_HYPERFLASH_FS_EXT_ADDR_TYPE cifar10_L3_Flash = 0;

// structure for performance counters // rt_perf_t *perf; // uint32_t perf_ld; PI_L2 uint32_t perf_values[ARCHI_CLUSTER_NB_PE] = {0};

static void RunNetwork(void *arg) { pi_perf_conf(1<<RT_PERF_LD);

printf("Running on cluster\n");

uint32_t core_id;
// for (int i = 0; i < 1; ++i) // 2000, 1000, 500, 100, 40
// {
    /* code */      
    pi_perf_start();
    core_id = pi_core_id();
    __PREFIX(CNN)(Input_1, Output_1);
    pi_perf_stop();
// }
perf_values[core_id] = pi_perf_read(PI_PERF_LD);
printf("Runner completed\n");

printf("\n");

}

int start() { printf("Entering main controller\n");

char *ImageName = "/home/tp19021/gap_sdk/examples/nntool/cifar10/images/cifar10_image_3196.ppm";

struct pi_device cluster_dev;
struct pi_cluster_task *task;
struct pi_cluster_conf conf;

pi_cluster_conf_init(&conf);
pi_open_from_conf(&cluster_dev, (void *)&conf);
pi_cluster_open(&cluster_dev);

task = pmsis_l2_malloc(sizeof(struct pi_cluster_task));
if (!task) {
    printf("failed to allocate memory for task\n");
}
memset(task, 0, sizeof(struct pi_cluster_task));
task->entry = &RunNetwork;
task->stack_size = STACK_SIZE;
task->slave_stack_size = SLAVE_STACK_SIZE;
task->arg = NULL;

printf("Constructor\n");

// IMPORTANT - MUST BE CALLED AFTER THE CLUSTER IS SWITCHED ON!!!!
if (__PREFIX(CNN_Construct)())
{
    printf("Graph constructor exited with an error\n");
    return 1;
}

printf("Reading image\n");
//Reading Image from Bridge
// Note: this read image function can only read ppm (cportable pixel map images for color image)
// and pgm (portable graymap images for graymap images)
if (ReadImageFromFile(ImageName, AT_INPUT_WIDTH, AT_INPUT_HEIGHT, AT_INPUT_COLORS,
                      Input_1, AT_INPUT_SIZE*sizeof(IMAGE_IN_T), IMGIO_OUTPUT_CHAR, 0)) {
    printf("\n Failed to load image %s\n", ImageName);
    return 1;
}
printf("Finished reading image\n");

printf("Initialize Performace Counters \n");
// perf = rt_alloc(RT_ALLOC_L2_CL_DATA, sizeof(rt_perf_t));
// rt_perf_init(perf);

printf("Call cluster\n");

pi_pad_set_function(PI_PAD_31_B11_TIMER0_CH0, PI_PAD_31_B11_GPIO_A17_FUNC1 ); // for A17 pin
pi_gpio_pin_configure(NULL, GPIO, PI_GPIO_OUTPUT);
pi_gpio_pin_write(NULL, GPIO, 1);
// Execute the function "RunNetwork" on the cluster.
pi_cluster_send_task_to_cl(&cluster_dev, task);
pi_gpio_pin_write(NULL, GPIO, 0);

for (uint32_t i = 0; i < (uint32_t) ARCHI_CLUSTER_NB_PE; i++)
{
    printf("[%d %d] PERF_LD : %d \n", 0, i, perf_values[i]);
}

for (int i = 0; i < 10; ++i)
{
    printf("Output %d: %d \n", i+1, Output_1[i]);
}

__PREFIX(CNN_Destruct)();

pmsis_exit(0);
printf("Ended\n");
return 0;

}

int main(void) { set_gap8_state(); // my function for setting input voltage, CL FREQ and FC FREQ printf(" NNTOOL cifar10 Deeper application \n"); return pmsis_kickoff((void *) start); } `

MoIbs-tech commented 4 years ago

To add to the above comment, when I change the gap_ncore() between 1 and 8, I do get a lower value of PERF_LD when gap_ncore() = 1 for 1 core whose value is successfully printed while the rest of the cores as mentioned shows 0 value as shown below:

For gap_ncore() = 1: ` Initialize Performace Counters Call cluster Running on cluster Runner completed

[0 0] PERF_LD_STALL : 11681583 [0 1] PERF_LD_STALL : 0 [0 2] PERF_LD_STALL : 0 [0 3] PERF_LD_STALL : 0 [0 4] PERF_LD_STALL : 0 [0 5] PERF_LD_STALL : 0 [0 6] PERF_LD_STALL : 0 [0 7] PERF_LD_STALL : 0 For gap_ncore = 8: Initialize Performace Counters Call cluster Running on cluster Runner completed

[0 0] PERF_LD_STALL : 1474391 [0 1] PERF_LD_STALL : 0 [0 2] PERF_LD_STALL : 0 [0 3] PERF_LD_STALL : 0 [0 4] PERF_LD_STALL : 0 [0 5] PERF_LD_STALL : 0 [0 6] PERF_LD_STALL : 0 [0 7] PERF_LD_STALL : 0

` note: the LD_STALL print is an outdated print, the above value is for PERF_LD i,e. the number of ld instructions executed.

haugoug commented 4 years ago

The problem is that RunNetwork is only executed by core 0. You have to do a fork so that all cores execute the same function for starting the performance counter and also for measuring it.

MoIbs-tech commented 4 years ago

I was hoping to measure for each core when funcition is parallelized across all cores rather than run the whole function for each core, but Im guessing thats not possible and not really a valid idea of measuring performance counters for a mutic-core processing scenario?


From: haugoug notifications@github.com Sent: 03 August 2020 20:02 To: GreenWaves-Technologies/gap_sdk gap_sdk@noreply.github.com Cc: Mohammed Ibrahim tp19021@bristol.ac.uk; Author author@noreply.github.com Subject: Re: [GreenWaves-Technologies/gap_sdk] Performance counters for each core during multicore processing (#157)

The problem is that RunNetwork is only executed by core 0. You have to do a fork so that all cores execute the same function for starting the performance counter and also for measuring it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/GreenWaves-Technologies/gap_sdk/issues/157#issuecomment-668190037, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANO5Y4ALOQT2R26PE4TZ7ZTR64CVXANCNFSM4PSQZIAQ.

sousoux commented 4 years ago

The fork is in the runner __PREFIX(CNN)(Input_1, Output_1);

If you change gap_ncore() you will change the amount of cores the fork occurs on.

On Tue, Aug 4, 2020 at 12:23 AM MoIbs notifications@github.com wrote:

I was hoping to measure for each core when funcition is parallelized across all cores rather than run the whole function for each core, but Im guessing thats not possible and not really a valid idea of measuring performance counters for a mutic-core processing scenario?


From: haugoug notifications@github.com Sent: 03 August 2020 20:02 To: GreenWaves-Technologies/gap_sdk gap_sdk@noreply.github.com Cc: Mohammed Ibrahim tp19021@bristol.ac.uk; Author < author@noreply.github.com> Subject: Re: [GreenWaves-Technologies/gap_sdk] Performance counters for each core during multicore processing (#157)

The problem is that RunNetwork is only executed by core 0. You have to do a fork so that all cores execute the same function for starting the performance counter and also for measuring it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub< https://github.com/GreenWaves-Technologies/gap_sdk/issues/157#issuecomment-668190037>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/ANO5Y4ALOQT2R26PE4TZ7ZTR64CVXANCNFSM4PSQZIAQ

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GreenWaves-Technologies/gap_sdk/issues/157#issuecomment-668270647, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQZY5TQFDJLNXGZNAPDRL3R642HRANCNFSM4PSQZIAQ .

haugoug commented 4 years ago

I mean you need to init, start and read the counters from each core. To do it, you can do a fork to execute a function on all cores that will init and start the counters on each core and another fork with another function to get the counter value

Yaooooo commented 4 years ago

I will close this issue, please feel free to reopen it if you have further question related. Thanks.