crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.41k stars 1.62k forks source link

code with static arrays is very very slow to compile with the --release flag #2485

Open alainravet opened 8 years ago

alainravet commented 8 years ago
# Crystal version 15, MacBook Pro (Retina, 13-inch, Early 2015, 16GB RAM)
#0.5 seconds : crystal build  staticarrays.cr ; ./staticarrays
#55 seconds  : crystal build  --release staticarrays.cr ; ./staticarrays

# file: staticarrays.cr
class Wrap
  def initialize
    @elements = StaticArray(String|Nil, 2048).new {}
    @next_elem = 0
  end

  def add(value)
    @elements[@next_elem] = value
    @next_elem += 1
  end

  def values
    @elements.to_a[0..@next_elem-1].compact
  end
end

w = Wrap.new
w.add "un"
w.add "deux"

puts w.values.inspect
asterite commented 8 years ago

I can't reproduce this. It does take a long time to compile because of the big static array (LLVM is slow with big static arrays), but eventually it compiles and runs fine.

When you say "FAIL", what do you mean? What happens?

ozra commented 8 years ago

Could be an issue with running out of RAM or similar? Just thinking loud...

alainravet commented 8 years ago

You're right @asterite, it does work but it's very very slow This sample takes about 0.5 sec without the release flag, and 50 seconds with the flag. (My original code had two of those static arrays and I gave up after 2 minutes.)

Is there a way to improve this behaviour?

asterite commented 8 years ago

@alainravet Yes, use a regular array. @elements = [] of String?. StaticArray has very few use cases, mostly low-level ones.

alainravet commented 8 years ago

@asterite I'm trying to optimize CPU intensive code (a chess engine). A "how to optimize your code" guide/wiki page would be useful.

waterlink commented 8 years ago

Slowness of non-static arrays (vector) usually comes from size extension operations. Therefore significant speed-up should be observable when vectors are created with pre-defined size (that won't change in runtime). It will not be as fast as static array (with all optimizations applied), but it should be faster then blind usage of vectors.

BTW, how does the benchmark of current code with non-static arrays look like?

Would it be possible to share code (if it is open-source), so that we can play around with it and come up with some optimizations? (I would be interested myself in that).

alainravet commented 8 years ago

@waterlink I'm porting a C code project to Crystal and, at first, I refactored to OO as much as possible. The code is better structured and more readable and more ... , but it also creates tons of objects and is slow (relatively speaking). I am now unfactoring the expensive parts, using pools of arrays to reuse instead of creating new ones, unwrapping primitives (as we can't extend the Int* structs 😁 ), ... It looks like you can't have your cake and eat it after all.

BTW, how does the benchmark of current code with non-static arrays look like?

It takes 2 minutes to explore 119_060_324 millions possible moves in 6 plies (Perft test). I'll share it next week, once it's cleaner/less messy. It could be useful to measure performance changes between Crystal versions.

asterite commented 8 years ago

@alainravet I can't say how to optimize the program without profiling it, but another alternative to Array is Slice. The "problem" with Array is that if you do array[0..index], that is, you take a subrange of it, a new array is allocated. With a Slice, a view to the original slice is returned so no memory allocation is involved. However, a Slice cannot be resized. But it seems in your case you probably don't need to resize it.

In any case, once you share you'll code I'm sure we can all optimize together to be in the order of the C program.

asterite commented 8 years ago

There's nothing we can do here. LLVM is known to be slow with big static arrays. These should only be used in very few cases. Array and Slice and most appropriate in most cases. So I'm closing this.

mattrberry commented 4 years ago

Edit: Much of the issue I've described below isn't necessarily related. However, this slow compilation is still 100% an issue.

@asterite I just had a similar problem and spent a couple of hours trying to figure out the cause. The error messages were also unhelpful, which I'll describe below. I simply added a StaticArray(Float32, 4096) as an instance variable, causing my --release compile time jumped from ~5 seconds to ~4 minutes. I think it would be useful to add some more specifics regarding the use-cases of StaticArrays to the API docs.

As for the error messages, I ctrl-c'd the shards build --release the first time through since it was taking so long. Then, when trying to build with shards build --release again, I received 3000+ lines of output, of what appears to be one line of output for every single time an instance variable was referenced in my program. The last 40 lines lines were these, with the 3000+ leading up to this point resembling the ones at the top:

/home/matt/code/github/CryBoy/src/cryboy/opcodes.cr:2062: undefined reference to `*CPU#a:UInt8'
/home/matt/code/github/CryBoy/src/cryboy/opcodes.cr:2062: undefined reference to `*CPU#f_c=<Bool>:UInt8'
/home/matt/code/github/CryBoy/src/cryboy/opcodes.cr:2063: undefined reference to `*CPU#f_n=<Bool>:UInt8'
/home/matt/code/github/CryBoy/src/cryboy/opcodes.cr:2068: undefined reference to `*CPU#pc=<UInt16>:UInt16'
/home/matt/code/github/CryBoy/src/cryboy/opcodes.cr:2069: undefined reference to `*CPU#sp=<UInt16>:UInt16'
/home/matt/code/github/CryBoy/src/cryboy/opcodes.cr:2070: undefined reference to `*Memory#[]=<UInt16, UInt16>:Nil'
/home/matt/code/github/CryBoy/src/cryboy/opcodes.cr:2071: undefined reference to `*CPU#pc=<UInt16>:UInt16'
_main.o: In function `raise':
/usr/share/crystal/src/raise.cr:191: undefined reference to `*CallStack::new:CallStack'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*SDL::Error@Exception#callstack=<CallStack>:CallStack'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*Pointer(LibUnwind::Exception)@Pointer(T)::malloc:Pointer(LibUnwind::Exception)'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*UInt64@Number::zero:UInt64'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*UInt64@Number::zero:UInt64'
_main.o: In function `->':
/home/matt/code/github/CryBoy/src/cryboy/motherboard.cr:18: undefined reference to `*SDL::quit:Nil'
_main.o: In function `at_exit':
/usr/share/crystal/src/kernel.cr:471: undefined reference to `*Crystal::AtExitHandlers::add<Proc(Int32, (Exception+ | Nil), Nil)>:Array(Proc(Int32, (Exception+ | Nil), Nil))'
_main.o: In function `->':
/usr/share/crystal/src/gc/boehm.cr:164: undefined reference to `*SDL::Window#finalize:Nil'
/usr/share/crystal/src/gc/boehm.cr:164: undefined reference to `*SDL::Renderer#finalize:Nil'
_main.o: In function `->':
/usr/share/crystal/src/callstack.cr:125: undefined reference to `*Pointer(Void)@Comparable(T)#==<Pointer(Void)>:Bool'
/usr/share/crystal/src/callstack.cr:126: undefined reference to `*CallStack::RepeatedFrame#incr:Int32'
/usr/share/crystal/src/callstack.cr:128: undefined reference to `*CallStack::print_frame<CallStack::RepeatedFrame>:Int32'
/usr/share/crystal/src/callstack.cr:129: undefined reference to `*CallStack::RepeatedFrame::new<Pointer(Void)>:CallStack::RepeatedFrame'
_main.o: In function `raise':
/usr/share/crystal/src/raise.cr:191: undefined reference to `*CallStack::new:CallStack'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*OverflowError@Exception#callstack=<CallStack>:CallStack'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*Pointer(LibUnwind::Exception)@Pointer(T)::malloc:Pointer(LibUnwind::Exception)'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*UInt64@Number::zero:UInt64'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*UInt64@Number::zero:UInt64'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*CallStack::new:CallStack'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*Exception+@Exception#callstack=<CallStack>:CallStack'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*Pointer(LibUnwind::Exception)@Pointer(T)::malloc:Pointer(LibUnwind::Exception)'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*UInt64@Number::zero:UInt64'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*UInt64@Number::zero:UInt64'
/usr/bin/ld: /home/matt/code/github/CryBoy/bin/cryboy: No symbol version section for versioned symbol `*File@IO::Evented#resume_read<Bool>:Nil'
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
Error: execution of command failed with code: 1: `cc "${@}" -o '/home/matt/code/github/CryBoy/bin/cryboy'  -rdynamic  -lSDL2 -lpcre -lm /usr/bin/../lib/crystal/lib/libgc.a -lpthread /usr/share/crystal/src/ext/libcrystal.a -levent -lrt -ldl -L/usr/bin/../lib/crystal/lib -L/usr/bin/../lib/crystal/lib`
jhass commented 4 years ago

Since you Ctrl+C, the error messages are more due to #5890 / #9314 :)

j8r commented 4 years ago

StaticArray#new(&) code (which causes the slow compilation) reduced:

array = uninitialized Int32[4096]
4096.times do |i|
  array.to_unsafe[i] = 0
end
array

I have also found a work-around that compiles a lot faster:

struct StaticArray(T, N)
  def self.new_fast(& : Int32 -> T)
    array = uninitialized self
    buf = array.to_unsafe
    {% for i in 0...N %}
    buf[{{i.id}}] = yield {{i.id}}
    {% end %}
    array
  end
end
jhass commented 4 years ago

So, we did some digging into this and apparently Crystal in --release mode invokes LLVM in a way that makes it really hard for LLVM to optimize that loop. Doing as little as dumping the unoptimized LLVM IR (with crystal build --emit=llvm-ir) and running llc -O3 on it, is substantially (orders of magnitude) faster. It only takes a couple seconds.

In both cases, llc -O3 and crystal build --release, LLVM ends up completely unrolling that loop. So @j8r macro version above, just pre-unrolls the loop in Crystal land and should lead to equivalent of the optimized code. So it seems that LLVM with the crystal build --release options somehow pays a much higher price during loop unrolling than it does for llc -O3.

The first candidates I could find from a quick glance here seem to be https://github.com/crystal-lang/crystal/blob/68f0f38a22d1470eee722f6c41a2532f3ef2bb78/src/compiler/crystal/compiler.cr#L572-L573 but removing them makes no real difference in my tests.

So, if somebody can figure out what the heck is different in the optimizer configuration between llc and crystal here (it might be easier to see if you can get llc to be slow too, then figure out how to avoid this setting from the compiler), that might give a general compiler speed bump beyond this issue.

If not, we should at least apply @j8r's workaround for now.

waj commented 4 years ago

@jhass you're just running llc which translates LLVM bitcode to machine code, but you need to run opt -O3 first to reproduce the same delay. In my test opt takes just a couple of seconds, but then that optimized output takes forever for llc to translate.

I guess this has something to do with returning the array as value from the function. Note that this cannot be reproduced in C so probably clang never faced this issue. The workaround still returns the array as value, but for some reason that combination trigger some optimization edge case.

I just tried replacing the new method with an initialize instead, that internally receives the struct as a pointer. It should be exactly the same but it doesn't trigger the problem. I'll perform more tests locally and send a PR.

jhass commented 4 years ago

Huh, that's confusing, why does llc take an -O3 option then? Also in my tests the llc -O3 generated assembly looked identical to the optimized LLVM IR, in that it fully unrolled the loop.

waj commented 4 years ago

I know, and for a second I thought we had an opportunity for a huge performance improvement. But the result is not the same. The optimizations they perform are different. I just tested with samples/http_server.cr and running llc optimizations alone improves from debug mode, but not as much as running all the optimizations.

jhass commented 4 years ago

I see, well TIL. Oh dear LLVM, could be so easy...

kefahi commented 3 years ago

I have very similar observation when trying to build a simple test code with relatively large array. I found that code as part of some bench marking exercise that test code has an array with 10,000 entries. Running crystal build --no-debug --progress --release crTest.cr is not slow it actually enters an infinite loop (it never ends) with 100% cpu utilization in the last stages. If the size of the array is reduced, say to 1000 entries instead of 10,000, the code compiles successfully in 30 seconds.

I use Crystal 0.36.1 on Fedora.

jwoertink commented 3 years ago

I'm also getting this using 1.0.0. I tried to create a StaticArray of String with this domain list (about 120,000 strings). The file looked like:

StaticArray[
"...",
"...",
"and so on for another 120k rows",
]

After 23minutes, my terminal died when my MBP hit 25GB usage on Crystal 😂

❯ crystal run --release --stats email.cr
Parse:                             00:00:00.111867105 (  30.85MB)
Semantic (top level):              00:00:01.072054227 ( 310.95MB)
Semantic (new):                    00:00:00.001057807 ( 310.95MB)
Semantic (type declarations):      00:00:00.080901451 ( 310.95MB)
Semantic (abstract def check):     00:00:00.004426140 ( 310.95MB)
Semantic (ivars initializers):     00:00:00.051636747 ( 310.95MB)
Semantic (cvars initializers):     00:00:00.092158086 ( 342.95MB)
Semantic (main):                   00:00:00.578983974 ( 407.00MB)
Semantic (cleanup):                00:00:00.003794497 ( 407.00MB)
Semantic (recursive struct check): 00:00:00.000463927 ( 407.00MB)
Codegen (crystal):                 00:00:01.779074775 ( 487.12MB)
zsh: killed
Crystal 1.0.0 [dd40a2442] (2021-03-22)

LLVM: 6.0.1
Default target: x86_64-apple-macosx
kefahi commented 3 years ago

I, too, confirm that the issue still exists with Crystal 1.0.0.

jwoertink commented 3 years ago

Maybe StaticArray can do what Tuple does with the limit? https://github.com/crystal-lang/crystal/blob/d09d9e75716b35391e3e0ed5a4dbf112661fad05/src/compiler/crystal/semantic/bindings.cr#L617 Or could at least be a "quick fix" until something better comes along. In this case it would at least tank the compiler instead of chilling in the CI (my CI) for an hour 😂

asterite commented 3 years ago

No because it's valid to have static arrays of ten K elements. For example for a buffer in the stack.

The issue is initializing a large static array.

But in general, I would advise simply not using static array at all. Use Array.

straight-shoota commented 3 years ago

Showing an error message early is certainly better than setting the compiler off to run forever. The error could just apply to static array literals over a certain size. That should be fine to do.

I totally agree on the general note to just use array instead. The use case for static array is to allocate stack memory in a performance-sensitive path where heap allocation would be costly. But don't put program data in a static array. It makes not much sense.

asterite commented 3 years ago

The use case for static array is to allocate stack memory in a performance-sensitive path

Yes, actually, the only reason we have StaticArray is to be able to get a buffer of bytes on the stack. Any other usage is pretty contrived in my opinion. And if that's the case, it might be good (for 2.0) to just remove the type and have a specific type for that.

straight-shoota commented 3 years ago

remove the type and have a specific type for that.

What do you mean? I'd actually consider StaticArray already a special type for that. It's just technically possible to use it for other things, too...

asterite commented 3 years ago

I mean, removing StaticArray(T) and replacing it with a non-generic type ByteBuffer. That way people can't use it to create a giant static array of strings, because such a type wouldn't exist.

Fryguy commented 3 years ago

I think part of the issue is the name StaticArray doesn't automatically trigger a "this is on the stack" thought in one's mind. Name-wise it just sounds like a normal Array but defined statically with a fixed size.

kefahi commented 3 years ago

Not sure I understand the argument against StaticArray, but I switched the test code to use it and it solved the problem for me.

speedTest ) time crystal build --stats  --no-debug --progress --warnings all --time --release --ll --verbose crTest.cr
Parse:                             00:00:00.006544409 (   3.27MB)
Semantic (top level):              00:00:00.257241234 (  58.52MB)
Semantic (new):                    00:00:00.000920566 (  58.52MB)
Semantic (type declarations):      00:00:00.018386054 (  74.52MB)
Semantic (abstract def check):     00:00:00.004317707 (  74.52MB)
Semantic (ivars initializers):     00:00:00.014386552 (  74.52MB)
Semantic (cvars initializers):     00:00:00.069815220 (  90.52MB)
Semantic (main):                   00:00:00.196561334 ( 106.58MB)
Semantic (cleanup):                00:00:00.003858058 ( 106.58MB)
Semantic (recursive struct check): 00:00:00.000710583 ( 106.58MB)
Codegen (crystal):                 00:00:00.291614535 ( 122.58MB)
Codegen (bc+obj):                  00:01:11.217718396 ( 122.58MB)
cc _main.o -o /home/kefah/Downloads/youtube-tutorial-scripts/speedTest/crTest  -rdynamic -L/usr/bin/../lib/crystal/lib -lpcre -lm -lgc -lpthread /usr/share/crystal/src/ext/libcrystal.a -levent  -lrt -ldl
Codegen (linking):                 00:00:00.619240754 ( 122.58MB)

Codegen (bc+obj):
 - no previous .o files were reused

________________________________________________________
Executed in   72.87 secs    fish           external
   usr time   70.54 secs    0.20 millis   70.54 secs
   sys time    1.91 secs    1.03 millis    1.91 secs
darkstego commented 2 years ago

I do think StaticArray can be beneficial in cases like #11370. If I simply change to a regular Array performance drops by 33x for small strings to around 2x for long ones, which seems to indicate a significant increase in initialization times.

But I do believe that until a resolution is found it merits at least a compiler warning for StaticArrays over a certain size. I got bit by this one, but luckily it happened just after a single change so I was able to quickly isolate the problem. But I can imagine situations where someone is scratching their head trying to figure out why compile speeds suddenly got out of hand.