Parallelize relevant portions of IR2Vec with OMP

IITH-Compilers / IR2Vec

Implementation of IR2Vec, published in ACM TACO

Other

80 stars 35 forks source link

Parallelize relevant portions of IR2Vec with OMP #101

Open svkeerthy opened 3 months ago

m-atalla commented 1 month ago

Hi, this seems like an interesting enhancement that I'd like to help out on.

I think its important to have a baseline to compare against for any potential improvements, is the TimeTaken experiment suitable for that? Further, is there a script I could use to generate time taken as in experiments/TimeTaken/TimeTaken_Algos.csv?

I'd be happy to add an additional benchmark as well, the SQLite Amalgamation might be an interesting option.

svkeerthy commented 1 month ago

Hi @m-atalla,

Apologies for the delay in response. We do not have a script for this yet. It would be great if you could help in this. SQLite Amalgamation is also very interesting and would be a valuable addition.

We have started integrating OMP with IR2Vec (See #105, which is a work in progress).

Please feel free to reach out if you need any inputs or have further questions. Will be happy to help :)

Best, Venkat

m-atalla commented 2 weeks ago

Hi, I wanted to follow up with profiling info on SQLite benchmark now that its added!

I used Linux perf to get the profile data using the following commands:

$ perf record -g --call-graph dwarf build/bin/ir2vec --sym -level p ./src/test-suite/PE-benchmarks-llfiles-llvm17/sqlite3.ll -o sqlite.txt
$ perf script > /tmp/sym-perf.out

And I used the firefox profiler to analyze and upload the profile data which could be found here. From the call tree it seems that about 53% of the time is spent on parsing (not much could be done about it) and 44% is spent in IR2Vec_Symbolic::bb2Vec which should a good target for parallelism. Fortunately it looks like #105 is already making progress on it!

Similarly, I generated a profile for the flow-aware (FA) mode which could found here. The call tree shows the following functions IR2Vec_FA::solveInsts and IR2Vec_FA:func2Vec with 33% and 24% of the time respectively.

It'd be happy to assist further as needed.

Thank you. Mohamed.

svkeerthy commented 2 weeks ago

Hi @m-atalla,

Thanks for the perf report :) It exposes more opportunities for optimizations in addition to parallelization.

On the top of my mind, I have two things:

As you had also pointed out, one of the major overheads in FA flow is the solveInsts method that internally invokes the Eigen solver. We recently made Eigen an optional dependency. i.e., if Eigen is not available, we approximate the solution with a handwritten solver. It would be interesting to see if it reduces the current overhead.
14% of the total time is spent on SmallVector copy in the IR2Vec_FA::func2vec method. It would be good to eliminate or reduce this overhead by using references or moves.

Perhaps I will create separate issues to track these as the objective of these points is a bit different from that of the current issue. Please give me some time. I will have a more detailed look at the perf report and get back with more possible improvements.