Open trzecieu opened 3 years ago
This isn't a known issue. It should fail to allocate (and halt), or it should hit the limit in the code you quote and avoid running the analysis. So that it runs it and segfaults suggest were are missing a check somehow. But reading the code for copies
it is allocated properly AFAICT.
Do these assertions catch anything?
diff --git a/src/cfg/liveness-traversal.h b/src/cfg/liveness-traversal.h
index 6deab2fd6..7eec02682 100644
--- a/src/cfg/liveness-traversal.h
+++ b/src/cfg/liveness-traversal.h
@@ -277,6 +277,7 @@ struct LivenessWalker : public CFGWalker<SubType, VisitorType, Liveness> {
}
void addCopy(Index i, Index j) {
+ assert(i < numLocals && j < numLocals);
auto k = std::min(i, j) * numLocals + std::max(i, j);
copies[k] = std::min(copies[k], uint8_t(254)) + 1;
totalCopies[i]++;
@@ -284,6 +285,7 @@ struct LivenessWalker : public CFGWalker<SubType, VisitorType, Liveness> {
}
uint8_t getCopies(Index i, Index j) {
+ assert(i < numLocals && j < numLocals);
return copies[std::min(i, j) * numLocals + std::max(i, j)];
}
};
Looking at the line numbers in the traces, they don't match anything reasonable in current main
or in the hash mentioned in the link. Which specific version was this compiled from?
This time when I've tested modified binaryen I've got segmentation fault problem.
Interestingly, wasm2js didn't fail on my other machine, that gave me a point of view that hardware limitation could be the problem. And indeed. wasm2js run out of RAM (probably not memory leak, but growing vector totalCopies
).
While tracking memory consumption on another configuration I've found that processing 90MiB wasm file requires 40GiB of RAM in peak, which exceeds the original machine system memory (32GiB) and crash takes a place when last measurement point was 28GiB consumed by wasm2js.
Here is a chart of memory usage over time for a system with 64GiB of RAM
Ok, definitely sounds like running out of memory then. I'm surprised it doesn't report an error, but perhaps while things run in parallel it runs out of memory in another pass and hits an unhandled code path somehow.
We should probably reduce the limit here: https://github.com/WebAssembly/binaryen/blob/760a51bd15a51f02bc1c75087a9cd9e11b9f27bb/src/cfg/liveness-traversal.h#L179-L182
Right now that stops at 2^16
locals which means a matrix of size 2^32
of 1 byte, so 2GB. When running in parallel we can use 2GB per core, and so if you have 20 cores you can each 40GB as in that screenshot.
To confirm that's the issue, does running with say BINARYEN_CORES=4
in the environment fix things? (then it should use just 4 cores, and less than 8GB)
Hi, the original test was performed on machine with 4 cores with HT, so 8 logical cores.
When I limited execution with BINARYEN_CORES=4
and made sure that only 4 workers are spawned I've hit the memory limit. A little bit later tho. In general it looks like one particular section of code is problematic to wasm2js to handle, as memory spikes pretty drastically there:
The build is pretty unusual, and I suspect that ASAN puts a number of locals to the wasm binary, that later wasm2js struggles to handle. I will try to process *.wat file to see some outstanding functions. Is there something that I should look closely in particular?
Hmm, it may be due to flattening, which increases the number of locals. Looking at the local numbers might be interesting.
Running with BINARYEN_PASS_DEBUG=1
in the env might help narrow down where the oom happens too.
It took significantly more time now (4h30m) and resulted also with termination plus following log:
[PassRunner] running passes...
[PassRunner] running pass: autodrop... 46.8855 seconds.
[PassRunner] (validating)
[PassRunner] running pass: legalize-js-interface... 1.79304 seconds.
[PassRunner] (validating)
[PassRunner] running pass: remove-non-js-ops... 9.32078 seconds.
[PassRunner] (validating)
[PassRunner] running pass: flatten... 242.145 seconds.
[PassRunner] (validating)
[PassRunner] running pass: i64-to-i32-lowering... 247.078 seconds.
[PassRunner] (validating)
[PassRunner] running pass: alignment-lowering... 25.0054 seconds.
[PassRunner] (validating)
[PassRunner] running pass: simplify-locals-nonesting... 571.282 seconds.
[PassRunner] (validating)
[PassRunner] running pass: precompute-propagate... 1598.1 seconds.
[PassRunner] (validating)
[PassRunner] running pass: avoid-reinterprets... 684.118 seconds.
[PassRunner] (validating)
[PassRunner] running pass: duplicate-function-elimination... 31.4696 seconds.
[PassRunner] (validating)
[PassRunner] running pass: ssa-nomerge... 772.552 seconds.
[PassRunner] (validating)
[PassRunner] running pass: dce... 149.285 seconds.
[PassRunner] (validating)
[PassRunner] running pass: remove-unused-brs... 196.556 seconds.
[PassRunner] (validating)
[PassRunner] running pass: remove-unused-names... 22.9558 seconds.
[PassRunner] (validating)
[PassRunner] running pass: optimize-instructions... 69.3187 seconds.
[PassRunner] (validating)
[PassRunner] running pass: pick-load-signs... 47.6139 seconds.
[PassRunner] (validating)
[PassRunner] running pass: precompute... 52.1255 seconds.
[PassRunner] (validating)
[PassRunner] running pass: code-pushing... 95.9126 seconds.
[PassRunner] (validating)
[PassRunner] running pass: simplify-locals-nostructure... 578.949 seconds.
[PassRunner] (validating)
[PassRunner] running pass: vacuum... 201.927 seconds.
[PassRunner] (validating)
[PassRunner] running pass: reorder-locals... 404.637 seconds.
[PassRunner] (validating)
[PassRunner] running pass: remove-unused-brs... 65.9696 seconds.
[PassRunner] (validating)
[PassRunner] running pass: coalesce-locals... 283.747 seconds.
[PassRunner] (validating)
[PassRunner] running pass: simplify-locals... 297.147 seconds.
[PassRunner] (validating)
[PassRunner] running pass: vacuum... 98.7764 seconds.
[PassRunner] (validating)
[PassRunner] running pass: reorder-locals... 25.6809 seconds.
[PassRunner] (validating)
[PassRunner] running pass: coalesce-locals... 145.851 seconds.
[PassRunner] (validating)
[PassRunner] running pass: reorder-locals... 25.5365 seconds.
[PassRunner] (validating)
[PassRunner] running pass: vacuum... 82.2849 seconds.
[PassRunner] (validating)
[PassRunner] running pass: code-folding... 64.0878 seconds.
[PassRunner] (validating)
[PassRunner] running pass: merge-blocks... 78.4132 seconds.
[PassRunner] (validating)
[PassRunner] running pass: remove-unused-brs... 59.0452 seconds.
[PassRunner] (validating)
[PassRunner] running pass: remove-unused-names... 10.5026 seconds.
[PassRunner] (validating)
[PassRunner] running pass: merge-blocks... 52.6485 seconds.
[PassRunner] (validating)
[PassRunner] running pass: precompute... 24.882 seconds.
[PassRunner] (validating)
[PassRunner] running pass: optimize-instructions... 48.116 seconds.
[PassRunner] (validating)
[PassRunner] running pass: rse... 74.6309 seconds.
[PassRunner] (validating)
[PassRunner] running pass: vacuum... 81.3507 seconds.
[PassRunner] (validating)
[PassRunner] running pass: dae-optimizing... 19.1415 seconds.
[PassRunner] (validating)
[PassRunner] running pass: inlining-optimizing... 16.7154 seconds.
[PassRunner] (validating)
[PassRunner] running pass: duplicate-function-elimination... 23.8512 seconds.
[PassRunner] (validating)
[PassRunner] running pass: duplicate-import-elimination... 0.0578341 seconds.
[PassRunner] (validating)
[PassRunner] running pass: simplify-globals-optimizing... 6.84558 seconds.
[PassRunner] (validating)
[PassRunner] running pass: remove-unused-module-elements... 21.3974 seconds.
[PassRunner] (validating)
[PassRunner] running pass: memory-packing... 0.0418483 seconds.
[PassRunner] (validating)
[PassRunner] running pass: directize... 0.0561223 seconds.
[PassRunner] (validating)
[PassRunner] running pass: generate-stack-ir... 5.07717 seconds.
[PassRunner] (validating)
[PassRunner] running pass: optimize-stack-ir... 303.577 seconds.
[PassRunner] (validating)
[PassRunner] running pass: avoid-reinterprets... 193.407 seconds.
[PassRunner] (validating)
[PassRunner] running pass: flatten... 227.296 seconds.
[PassRunner] (validating)
[PassRunner] running pass: simplify-locals-notee-nostructure... 452.76 seconds.
[PassRunner] (validating)
[PassRunner] running pass: remove-unused-names... 20.9014 seconds.
[PassRunner] (validating)
[PassRunner] running pass: merge-blocks... %
I suspect it is much slower in that mode to much heavier validation. --no-validation
might make it a lot faster while still showing the bug, assuming the bug is not related to validation.
Interesting, so it fails in merge-blocks
? Is that consistent each time you run it, or maybe it is random?
Looks random, this time:
[PassRunner] running passes...
[PassRunner] running pass: autodrop... 46.0485 seconds.
[PassRunner] running pass: legalize-js-interface... 2.28625 seconds.
[PassRunner] running pass: remove-non-js-ops... 9.87816 seconds.
[PassRunner] running pass: flatten... 259.985 seconds.
[PassRunner] running pass: i64-to-i32-lowering... 247.905 seconds.
[PassRunner] running pass: alignment-lowering... 24.9692 seconds.
[PassRunner] running pass: simplify-locals-nonesting... 567.913 seconds.
[PassRunner] running pass: precompute-propagate... 1555.96 seconds.
[PassRunner] running pass: avoid-reinterprets... 712.264 seconds.
[PassRunner] running pass: duplicate-function-elimination... 31.4762 seconds.
[PassRunner] running pass: ssa-nomerge... 771.898 seconds.
[PassRunner] running pass: dce... 149.456 seconds.
[PassRunner] running pass: remove-unused-brs... 185.211 seconds.
[PassRunner] running pass: remove-unused-names... 22.9299 seconds.
[PassRunner] running pass: optimize-instructions... 69.0268 seconds.
[PassRunner] running pass: pick-load-signs... 47.5824 seconds.
[PassRunner] running pass: precompute... 51.9369 seconds.
[PassRunner] running pass: code-pushing... 95.9054 seconds.
[PassRunner] running pass: simplify-locals-nostructure... 576.945 seconds.
[PassRunner] running pass: vacuum... 201.944 seconds.
[PassRunner] running pass: reorder-locals... 404.3 seconds.
[PassRunner] running pass: remove-unused-brs... 65.8322 seconds.
[PassRunner] running pass: coalesce-locals... 255.804 seconds.
[PassRunner] running pass: simplify-locals... 304.56 seconds.
[PassRunner] running pass: vacuum... 113.623 seconds.
[PassRunner] running pass: reorder-locals... 28.1115 seconds.
[PassRunner] running pass: coalesce-locals... 151.807 seconds.
[PassRunner] running pass: reorder-locals... 25.449 seconds.
[PassRunner] running pass: vacuum... 82.1876 seconds.
[PassRunner] running pass: code-folding... 63.9698 seconds.
[PassRunner] running pass: merge-blocks... 79.9715 seconds.
[PassRunner] running pass: remove-unused-brs... 58.7302 seconds.
[PassRunner] running pass: remove-unused-names... 10.4656 seconds.
[PassRunner] running pass: merge-blocks... 52.3901 seconds.
[PassRunner] running pass: precompute... 24.8123 seconds.
[PassRunner] running pass: optimize-instructions... 47.8441 seconds.
[PassRunner] running pass: rse... 74.478 seconds.
[PassRunner] running pass: vacuum... 81.0584 seconds.
[PassRunner] running pass: dae-optimizing... 19.1527 seconds.
[PassRunner] running pass: inlining-optimizing... 16.7058 seconds.
[PassRunner] running pass: duplicate-function-elimination... 23.8856 seconds.
[PassRunner] running pass: duplicate-import-elimination... 0.0457912 seconds.
[PassRunner] running pass: simplify-globals-optimizing... 6.81031 seconds.
[PassRunner] running pass: remove-unused-module-elements... 21.4296 seconds.
[PassRunner] running pass: memory-packing... 0.0207571 seconds.
[PassRunner] running pass: directize... 0.0281063 seconds.
[PassRunner] running pass: generate-stack-ir... 4.5748 seconds.
[PassRunner] running pass: optimize-stack-ir... 293.079 seconds.
[PassRunner] running pass: avoid-reinterprets... 193.185 seconds.
[PassRunner] running pass: flatten... 230.451 seconds.
[PassRunner] running pass: simplify-locals-notee-nostructure... 452.479 seconds.
[PassRunner] running pass: remove-unused-names... 20.5806 seconds.
[PassRunner] running pass: merge-blocks... 232.503 seconds.
[PassRunner] running pass: coalesce-locals... %
Sounds like a general memory usage issue then. In general I'm not sure what we can do, as this combination of options plus your input seems to just create incredibly large output. The only thing might be to add lower limits for certain things, but it's hard to decide where to put those limits.
Note that -O3
is a very high optimization level. Does this work with -O1
perhaps? That gets most of the optimizations with a lot less work.
Yes, this is not a blocker issue, more like problem exploration. The build is indeed not common -O3 -g2 -fsanitize=address`, and making optimization level smaller or remove ASAN reduces memory consumption.
At some moment there is a code that resizes a copies vector by the numLocals^2 - this is the place that potentially is problematic. Is there any room to optimize that?
Nonetheless, thank you kindly for helping debugging that. I'm leaving to you whether this issue can be closed or not.
Cheers,
That N^2
matrix could be a sparse matrix. But that would bring its own tradeoffs with slower access times.
The entries there are 8 bit, allowing us to count up to 255 copies between locals. That is probably too high for most use cases. So we might save a factor of 2 or 3 in memory there, perhaps, but not more, so that won't fully solve this.
Or we could limit N
to something even lower than 2^16
which we have now. But such arbitrary limits are troubling, as there will be users that would prefer to let their machine with huge memory process their huge project. And letting users apply BINARYEN_CORES
to limit memory usage is a workaround for the current situation for people with huge projects but not huge machines (though, it is not obvious...).
Overall I'm not sure how to improve this. Open to ideas!
And thanks for filing @trzecieu , hopefully we can find a good improvement. Leaving open for that.
Hi,
I've found that wasm2js is getting terminated during execution, After re-compiling wasm2js with address sanitizer suport I've got follwing report. The original wasm was compiled with
-O3 -g2 -fsanitize=address
(so, wasm with ASAN).The iteration speed over this issue is pretty slow, getting asan report took 90min of execution, now I'm tryign to get to run with debugger. Wat file is 1GiB size, which makes challenging to process it
While digging through code I've found that this might be already a suspected flaw: https://github.com/WebAssembly/binaryen/blob/760a51bd15a51f02bc1c75087a9cd9e11b9f27bb/src/cfg/liveness-traversal.h#L174-L176