Extracting static features programmatically

hrshtv commented 3 years ago

This link has a nice interface for extracting all MILEPOST static program features with the click of a button. Can we do the same thing programmatically in python? I'm looking for something along the following lines:

path = "example.c"
features = milepost.extract(path) # This would be a dict/list of the extracted features

Is something like this possible?

gfursin commented 3 years ago

Hi @hrshtv,

Thank you for your interest!

At this moment, there is no OO class for Milepost. However, I plan to gradually convert CK to be more pythonic in 2021.

In the meantime, you can extract MILEPOST features for a given CK program as follows:

import ck.kernel as ck

r=ck.access({'action':'extract', 'module_uoa':'program.static.features', 'data_uoa':'cbench-automotive-susan'})
if r['return']>0: ck.err(r)

features=r.get('dict',{}).get('features',{})

You need to have MILEPOST GCC installed via CK.

If you want to extract features from an arbitrary source code, just copy paste some CK program to a dummy CK program, add your source code and add it to CK meta, something as follows:

ck cp program:cbench-automotive-susan program:my-dummy-program

ck find program:my-dummy-program

# Add source code there; and add its name to .cm/meta.json 

ck extract program.static.features:my-dummy-program

If it sounds useful, I can provide more explanations ...

Also, @ChrisCummins is working on a related infrastructure and he mentioned that he plans to release it soon - they are using cool deep learning techniques to learn optimization heuristics and you may be interested to follow their projects too!

hrshtv commented 3 years ago

Thanks for the explanation! Is there any documentation that explains the arguments of the functions used? For example, ck.access({...})

gfursin commented 3 years ago

Some limited description is available at https://ck.readthedocs.io/en/latest/src/ck.html#ck.kernel.access .

This function always takes dict as input with

"module_uoa" : name of the CK module such as program.static.features
"action" : name of the function inside above module all other keys are passed to the above function inside a given CK module.

You can find the input keys and the output dictionary for a given module and action from the cmd as follows:

ck extract program.static.features --help

UOA is an abbreviation for CK UID or alias, i.e. you can use both the user friendly name such as "program.static.features" or it's internal UID (92a02f0445148203)

My hope/goal is to update all help pages for major APIs in 2021 ...

ChrisCummins commented 3 years ago

Hi @hrshtv, I'm following up here at Grigori's request with something that might be of interest to you. We just launched CompilerGym, a research platform for compiler autotuning. In particular, it exposes a handful of different program representations through a simple python interface.

For LLVM, we have a variety of different program representations, though not milepost (I'll look seeing how much work it would take to add).

The general usage would be:

Compile your program to LLVM-IR:

$ clang-10 -emit-llvm -c myapp.cc

In Python, create an LLVM environment to load your program, then print different observation spaces using:

>>> import gym
>>> import compiler_gym
>>> from compiler_gym.service.proto import Benchmark, File
# load the LLVM-IR file:
>>> path = "/path/to/myapp.bc"
>>> benchmark = Benchmark(uri=f"file:///{path}", program=File(uri=f"file:///{path}"))
# create a compiler session:
>>> env = gym.make("llvm-v0")
>>> env.reset(benchmark)
>>> env.observation["Programl"]
<networkx.classes.multidigraph.MultiDiGraph object at 0x7f9d8050ffa0>
>>> env.observation["Inst2vec"]
array([[-0.26956588,  0.47407162, -0.36637706, ..., -0.49256894,
         0.8016193 ,  0.71160674],
       [-0.59749085,  0.63315004, -0.0308373 , ...,  0.14833118,
         0.86420786,  0.44808227],
       [-0.59749085,  0.63315004, -0.0308373 , ...,  0.14833118,
         0.86420786,  0.44808227],
       ...,
       [-0.37584195,  0.43671703, -0.5360456 , ...,  0.6030259 ,
         0.82574934,  0.6306344 ],
       [-0.59749085,  0.63315004, -0.0308373 , ...,  0.14833118,
         0.86420786,  0.44808227],
       [-0.43074277,  0.8589559 , -0.35770646, ...,  0.28785184,
         0.8492773 ,  0.8914213 ]], dtype=float32)
>>> env.observation["Autophase"]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0])

where ProGraML and [inst2vec]() are two recent state-of-the-art deep learning representations.

Cheers, Chris

Edit: typos, see question below

gfursin commented 3 years ago

Hey Chris,

Thanks for sharing - looks really cool!

I got stuck with the above example on the following line:

env.reset(benchmark="file:////home/gfursin/work/susan.bc")

ValueError: Unknown benchmark "file:////home/gfursin/work/susan.bc"

The example at https://github.com/facebookresearch/CompilerGym worked fine:

...
; Function Attrs: nounwind
declare i32 @sprintf(i8*, i8*, ...) #3

; Function Attrs: nounwind
declare double @pow(double, double) #3

attributes #0 = { nounwind uwtable "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind readnone speculatable willreturn }
attributes #2 = { "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #3 = { nounwind "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #4 = { nounwind }

[  0   0   7   3   4   7   6   4   1   6   0   0   0  14   0  13  22   5
  19  34   5  12  23   7   2   0   2  21   0   2  12   0  13  23   7   6
   0  32   0   0   0   1   7   0   0  23   0   0   0   0  14 136 106   5
   0  61]
...

Will dig further into your project during vacations.

Thanks again for the update!!! Grigori

gfursin commented 3 years ago

I moved this question here: https://github.com/facebookresearch/CompilerGym/issues/12 .

ctuning / reproduce-milepost-project

Extracting static features programmatically #11