[Question] Executable size

luizperes commented 8 years ago

Hi, Thanks for this beautiful tutorial! It has been awesome to learn the basics of LLVM! My question is... Do you know a way of minimizing the size of the final executable? On my Linux the final executable has been generating about 80Mb, although I do not think that I would need so much in pure C, for example. I have a Brainfuck Interpreter that is sized 16Kb (even though small, still very big)... I do not have any clues about the libraries I could take off when executing my Makefile... Brainfuck is a very small language and for that, I would like to try to make a small compiler as well... Although I understand the power of the LLVM... Brainfuck may be only too simple.

Anyways I love your tutorial! Thank you so much! Luiz.

Lisapple commented 8 years ago

Hi Luiz,

Thank you for your interest in my tutorial, I made it after few months learning LLVM and I wanted to write as the help I would like to have when I just started, so I'm really happy to see that it has been useful for you.

LLVM is a really interesting project, you could invest plenty of hours into it and still haven't dive deeply into one of the many aspects. Even few basic concepts are easily misunderstanding, for instance how LLVM is actually working when you building a compiler: When you run the Makefile, the output is not a interpreter or a simple compiler, there is actually a virtual machine embed in the binary, with all libraries, with in this case: 'core' for all basics functions, 'mcjit', a "Just in Time" compiler, and 'native' and 'nativecodegen', to support and generate assembler depending of the output target. That why you binary is taking dozens of MB (33MB on my Mac). When you're running it, it generate IR (Intermediate Representation, the internal LLVM langage), it starts LLVM (aka. Low Level Virtual Machine, by calling "InitializeNativeTarget*()" functions), create an ExecutionEngine (to read and prepare to execute the IR generated program) and finally runs it.

It's really important to understand that the tutorial only teach how to write a compiler (in the literal sense, it means translating a language to another, in this case from BF to IR), what is called a "front-end", to let LLVM runs it or generate native code (using a "back-end", which means translating IR to assembler (for x86, ARM, SPARC, Hexagon and all supported assemblers), even binary). Once the IR generated, all our compiler work is done, LLVM will do the rest. It will parse IR, translate it to an assembler format supported by the system (like x86) and then build a binary (like Mach-O on my Mac) and then natively execute it. All theses steps to ensure that you're finally executing your BF program on native, like you'll do for C/C++ code.

So, what's the point to get a complete virtual machine to do a simple job like interpreting BF?

By creating a C program, if few lines, you can get the same results, thanks to clang (or gcc), but what is clang actually doing when compiling the BF interpreter/compiler? It reads the C program, translates it to IR, runs optimisations passes (there actually a lot of passes, not only for optimisation purposes; it's another really interesting subject) and generates assembler code, then use a linker to generate binary (all is done by LLVM) and then you can execute the native binary on your OS. With our BF front-end, we are removing the BF-to-C (done by hand) and C-to-IR parts by using our BF-to-IR front-end, the rest works exactly the same, and once executed, it runs as fast as a compiled C-program. I haven't wrote how to let LLVM generating binary code from IR, it's a little more complex than that (on Mac, it must use the system default linker, and it's depending of OS) but you can find documentation on it on the web (I'll maybe write on it, a day) but once done, you've got a fully working compiler from BF to native binary, not a simple interpreter!

If you want to learn more, I suggest you to look one of my other project, SMILLVM on my Github. It's a LLVM front-end for my dummy SMIL langage where it reads program file, generate IR and execute it with command args. Once the project built, you execute it with input SMIL program, it generate IR, runs optimisations and execute the binary, with input arguments. In fact, it's runs like a native interpreter, it missing the only thing to be a compiler: be able to get the native binary to execute separately (like clang do) which only containing SMIL program (without LLVM embed into it) to execute it with input arguments, like you would do with a C program. All work is done, it only missing few lines to get the binary once generated, I should really do that a day ;)

I also made a Python interpreter version, PySMIL, much more simpler and shorter (few hundred of lines in Python vs few thousands in C++ for LLVM front-end), and if you really want to be a full LLVM user, you can use Pyston, a Python-to-IR front-end that will generate binary code using LLVM to execute Python as native.

Hope this helps you to understand how LLVM is a really big thing, not only a complex but an useful, even a beautiful way to build advanced programs that you're using every day without worry about this complexity of the creation, from code to binary.

Feel free to ask any further questions, but you'll find many more informations on the subject on the web (starting the LLVM documentation). I also recommend you the read the LLVM IR documentation (http://llvm.org/docs/LangRef.html) and experiencing with the command "clang -S -emit-llvm {a c program}" to see how LLVM generates IR from C program, and also from a C++ program ("clang++ -S -emit-llvm {a c++ program}"), you'll be surprised of the difference!

Good luck! Max.

luizperes commented 8 years ago

Hi Max, thank you so much for your detailed post! There was indeed a lot of information I was mistaken and/or didn't know!

Thanks, Luiz.

luizperes commented 8 years ago

Sorry Max, one more question (if you could help me in this one)... I've done some research on the Internet about the llvm::getGlobalContext() although I don't have any information about that...

I'm getting this error: error: use of undeclared identifier 'getGlobalContext' llvm::LLVMContext &C = getGlobalContext();... On my LLVMContext.h header file I could see that this function is not there any longer, however, as I don't know how to get or create a valid Context, after trying all sorts of ways ] it seems that the llvm code has been valid, although it has not been executed by the MCJIT... screen shot 2016-04-25 at 4 09 11 pm

Do you know anything about that? Thanks a lot!

luizperes commented 8 years ago

I found one solution myself (however I don't know if it is the right way to do it)... instead of LLVMContext &C = getGlobalContext();, it is running by doing this: LLVMContext C;... not sure if it is right, though...

thanks!

Lisapple commented 8 years ago

Hi Luiz,

You are using the SVN/Git version, pre-3.8.1, I haven't tested this project on last committed work, only on last releases (the last one is currently the 3.8.0, which contains correctly the 'getGlobalContext' function), there may be a future plan to remove this function, I haven't follow commits for this. You should only use releases, this will save you time not modifying your code on each commit ;)

Yes, you can use 'LLVMContext C;', this will use a default context with empty data (and initialised with current target data), I'm really not sure doing this on earlier version but this could be preferred (if not mandatory) on future LLVM releases.

luizperes commented 8 years ago

I figured it out! Thanks!

Lisapple / BF-Compiler-Tutorial-with-LLVM

[Question] Executable size #2