To maximize parallel operations on a computer with an M3 processor that contains 11 CPUs and 11 threads, it is essential to optimize the workload distribution and ensure efficient resource utilization.
Implementation details:
Inputs: file with input matrix (you choose the size) and kernel (fixed size 4x4)
The goal is to first have a working sequential code for the four operations. Then, parallelize the
operations that can be efficiently parallelized. Pay special attention to data races (more threads
requesting the same input) and to concurrency/conflicts (multiple threads updating the same
output). As a general suggestion, to achieve the best performance you should group in one
thread (or in multiple threads executed in the same CPU) all the operations that work on the
same input data. This avoids costly data copy to multiple locations.
There are multiple ways to parallelize the code. You can parallelize the single convolution or you
can parallelize the convolutions (each thread executes a 4x4 convolution). Please discuss the
benefit of each solution and evaluate the performance of both.
Suggestion: when you parallelize convolutions pay attention that if multiple threads take
subsequent sliding convolutions they all will need the same part of the input data, thus....
You need to create an OpenMP file with the implementation of the convolution and a main
file for testing the function. The main will:
read the input matrix from a text file (matrix.txt) - randomly generated or static, you choose
read the kernel from a text file (kernel.txt) - fixed 4x4 size, you choose the values
apply convolution and save the result in a file
You need to present a performance report where you show the measurements of the
execution time of the sequential implementation (to simply, simply set the number of threads to
1) and of various parallel implementations (degree of parallelism, threads distribution, threads
grouping etc...). Write your consideration in a PDF document to add to the submission.
1 - consider the "zero padding", by performing convolution in the whole input matrix, till the last
column and the last row. Please do not add 3 extra rows and 3 extra columns of zeros in the
input matrix but try more smart solutions. The output matrix will have the same size as the input
matrix.
2 - consider bigger input matrix sizes and discuss if/why the performance improves.
Someone has some ideas?
To maximize parallel operations on a computer with an M3 processor that contains 11 CPUs and 11 threads, it is essential to optimize the workload distribution and ensure efficient resource utilization. Implementation details: Inputs: file with input matrix (you choose the size) and kernel (fixed size 4x4) The goal is to first have a working sequential code for the four operations. Then, parallelize the operations that can be efficiently parallelized. Pay special attention to data races (more threads requesting the same input) and to concurrency/conflicts (multiple threads updating the same output). As a general suggestion, to achieve the best performance you should group in one thread (or in multiple threads executed in the same CPU) all the operations that work on the same input data. This avoids costly data copy to multiple locations. There are multiple ways to parallelize the code. You can parallelize the single convolution or you can parallelize the convolutions (each thread executes a 4x4 convolution). Please discuss the benefit of each solution and evaluate the performance of both. Suggestion: when you parallelize convolutions pay attention that if multiple threads take subsequent sliding convolutions they all will need the same part of the input data, thus.... You need to create an OpenMP file with the implementation of the convolution and a main file for testing the function. The main will: