LaurentMazare / ocaml-torch

OCaml bindings for PyTorch
Apache License 2.0
412 stars 38 forks source link

Stack overflow from the compiler. #50

Closed pveber closed 3 years ago

pveber commented 3 years ago

When trying to build the current master (corresponding to pytorch 1.7), I get a stack overflow from the compiler for src/wrapper/torch_bindings_generated.ml. It is indeed a very long file, but still I'm surprised. Here is the backtrace:

Fatal error: exception Stack overflow
Raised by primitive operation at Mach.instr_cons_debug in file "asmcomp/mach.ml", line 137, characters 2-185
Re-raised at Misc.try_finally in file "utils/misc.ml", line 45, characters 10-56
Called from Asmgen.(++) in file "asmcomp/asmgen.ml" (inlined), line 79, characters 15-18
Called from Asmgen.compile_fundecl in file "asmcomp/asmgen.ml", line 84, characters 2-624
Called from Stdlib__list.iter in file "list.ml", line 110, characters 12-15
Called from Misc.try_finally in file "utils/misc.ml", line 31, characters 8-15
Re-raised at Misc.try_finally in file "utils/misc.ml", line 45, characters 10-56
Called from Asmgen.(++) in file "asmcomp/asmgen.ml" (inlined), line 79, characters 15-18
Called from Asmgen.end_gen_implementation in file "asmcomp/asmgen.ml", line 153, characters 2-128
Called from Misc.try_finally in file "utils/misc.ml", line 31, characters 8-15
Re-raised at Misc.try_finally in file "utils/misc.ml", line 45, characters 10-56
Called from Asmgen.compile_unit.(fun) in file "asmcomp/asmgen.ml", line 134, characters 7-231
Called from Misc.try_finally in file "utils/misc.ml", line 31, characters 8-15
Re-raised at Misc.try_finally in file "utils/misc.ml", line 45, characters 10-56
Called from Optcompile.clambda.(fun) in file "driver/optcompile.ml", line 78, characters 7-336
Called from Misc.try_finally in file "utils/misc.ml", line 31, characters 8-15
Re-raised at Misc.try_finally in file "utils/misc.ml", line 45, characters 10-56
Called from Compile_common.implementation.(fun) in file "driver/compile_common.ml", line 121, characters 71-113
Called from Misc.try_finally in file "utils/misc.ml", line 31, characters 8-15
Re-raised at Misc.try_finally in file "utils/misc.ml", line 45, characters 10-56
Called from Misc.try_finally in file "utils/misc.ml", line 31, characters 8-15
Re-raised at Misc.try_finally in file "utils/misc.ml", line 45, characters 10-56
Called from Misc.try_finally in file "utils/misc.ml", line 31, characters 8-15
Re-raised at Misc.try_finally in file "utils/misc.ml", line 45, characters 10-56
Called from Compenv.process_action in file "driver/compenv.ml", line 596, characters 6-59
Called from Stdlib__list.iter in file "list.ml", line 110, characters 12-15
Called from Compenv.process_deferred_actions in file "driver/compenv.ml", line 672, characters 2-61
Called from Optmain.main in file "driver/optmain.ml", line 55, characters 6-163
Re-raised at Location.report_exception.loop in file "parsing/location.ml", line 926, characters 14-25
Called from Optmain.main in file "driver/optmain.ml", line 133, characters 6-37
Called from Optmain in file "driver/optmain.ml", line 137, characters 2-9

I'm a bit puzzled because it is short, while I'd expected it to be very long since there's a stack overflow. Have you also met this problem at some point? There might be something fishy with my environment I don't know, but at least if I comment a part of the file it compiles fine.

LaurentMazare commented 3 years ago

I'm also running into this (and the CI does), I haven't found a good solution so what I do is just running only a single process when compiling the large files (via dune -j 1), and setting ulimit -s unlimited so that the stack can grow as large as it needs. The files are certainly large, that said it does not happen on the rust version of this library so maybe it's an issue within the compiler - I haven't reported anything to the compiler github so far.

pveber commented 3 years ago

Increasing the stack memory helps here:

$ ulimit -s 16384

So it's easy to fix, but more users of the library might stumble into the same problem once the next version will be released. Also I noticed that this step of the compilation requires more than 6 GB of RAM on my machine. I think we could derive a useful example to help the compiler team improve the compiler's performance.

pveber commented 3 years ago

Sorry I just saw your answer, thanks! Right, I think a report would be useful, this is really unusual performance. The thing is to reproduce the problem requires a few dependencies, I'm not sure what would be the useful form of reporting the problem to the compiler team.

LaurentMazare commented 3 years ago

Right, reporting it is likely to be useful but probably requires a bit of work to derive a "standalone" example. If you want to take a stab at it, please go ahead, otherwise I'll try to have a go at it when I find some time but it's unlikely to be in the next few weeks.

pveber commented 3 years ago

Done! It's not pretty, but it's standalone :). I'm filing a report now.

LaurentMazare commented 3 years ago

Thanks, this was super fast!

LaurentMazare commented 3 years ago

Following up on the discussion in https://github.com/ocaml/ocaml/issues/10072 I've tweaked the code generation to generate multiple functors (each with less than 100 included functions) rather than a single one, and combine them in the end. This seems to get rid of the stack overflow issue for me.

LaurentMazare commented 3 years ago

Closing this as this should hopefully be fixed now, feel free to re-open should you run into any further issues.

pveber commented 3 years ago

I can confirm the problem is gone, thanks a lot!