Open Quuxplusone opened 3 years ago
Attached file_50197.txt
(351 bytes, text/plain): Sample code
Current Codegen: https://godbolt.org/z/bWvGK5vhT
I 'd like to fix this issue if there are no objections.
LLVM mid-end generates comparison of j
with 1<<60 for loop's condition and following DAG node is created for it:
t19: i1 = setcc t7, Constant:i128<1152921504606846976>, setult:ch
Then it is combined into following nodes:
t33: i128 = srl t7, Constant:i64<60>
t40: i1 = setcc t33, Constant:i128<0>, setne:ch
That are legalized into:
t45: i64 = srl t41, Constant:i64<60>
t44: i64 = shl t42, Constant:i64<4>
t46: i64 = or t45, t44
t47: i64 = srl t42, Constant:i64<60>
t49: i64 = or t46, t47
t48: i8 = setcc t49, Constant:i64<0>, setne:ch
And finally combined into:
t53: i64 = fshl t42, t41, Constant:i64<4>
t47: i64 = srl t42, Constant:i64<60>
t49: i64 = or t53, t47
t48: i8 = setcc t49, Constant:i64<0>, setne:ch
I see several ways to get rid of FSHL: 1) Hook into DAGTypeLegalizer::ExpandShiftByConstant, check that SRL has only one use and it's SetCC Eq/Ne 0, and then skip unnecessary shifts; 2) Explicitly match expanded SRL and combine it into (setcc (or (srl a), b, C), 0, setne); 3) Add several peepholes into DAGCombiner to iteratively combine: a) (or (fshl a, b, C1), (srl a, C2)) => (or (rotl a), (srl b)) // it is profitable disregard setcc optimization because rotations are usually faster than funnel shifts b) (setcc (or (rotl a), b), 0, setne) => (setcc (or a, b), 0, setne)
Personally I more into 3rd approach as it will work not only for the single case with setcc, but I'm not quite sure if patterns that could be combined by it arise frequently enough to make that approach reasonable to implement.
Note that the issue is not only about 128-bit variables and the same suboptimal code will be emitted for 64-bit variable on 32-bit architectures (for Aarch32 in particular).
Review: https://reviews.llvm.org/D111530
file_50197.txt
(351 bytes, text/plain)